aiacademy: 機器學習 Practical Concerns
Data Preprocessing
-
Transforming features
- Min-max scaling: linearly scale each feature to the range [1,1] or [0,1]
- Standardize: scale each feature to N(0,1)
- Robust scaling: scale features by removing the outliers
- Thresholding
- Applying log or exponential function
-
Why scaling ?
-
For KNN, scaling prevents the distance scores being dominated by few features
-
For linear regression, logistic regression, SVM, scaling helps the optimization
-
For decision trees and random forest, scaling or not is not important
-
-
Why log or exponential ?
- positive skewed —> log
- 原本小的做完log還是小,原本大的做完log還是大。
- negative skweded —> exponential
- 原本小的做完log還是小,原本大的做完log還是大。
- positive skewed —> log
-
Missing values
- If the type of the feature is categorical
- Replace the missing values with the most frequent value
-
If the type of the feature is numerical
- Replace the missing values with the mean or median
-
Some more advanced techniques
- Predict the missing values, usually by interpolation and extrapolation
- If the type of the feature is categorical
-
Imbalanced classification
-
Most classification algorithms perform optimally when the number of samples in each class is similar
- Common techniques for imbalanced dataset
- Under-sample the majority class
- Over-sample the minority class
- A combination of both under-sampling and over-sampling
- Some more advanced techniques
- “Generate” simulated instances from the minority class
-
-
Data Preprocessing Summary
-
Consider transforming features
-
Fill missing values
-
Consider generating synthetic instances
-
Hyper-parameters
-
Selecting hyper-parameters
-
Example of hyper-parameters
- K in KNN
- C in the regularized linear classification
- step size in gradient descent
-
Strategies
- Grid search
- A fancy name of exhaustive search
- Manually specify subset of the hyper-parameter space and step size
- Try all parameter combinations and check thier performance
- Random search
- Suprisingly good performance
- Among the tunable parameters, important ones are only a few
- Suprisingly good performance
- Bayesian optimization
- Iteratively picking hyper-parameters for experiments
- Picking strategy: tradeoff between
-
Exploration (hyper-parameters for which the outcome is most uncertain), and
-
Exploitation (hy-er-parameters which are expected to have a good outcome)
-
- Grid search
-
-
Random serch is surprisingly good
-
Validating selected hyper-patameters
-
Summary of hyper-parameter tuning
- Grid search, random search, and Bayesian optimization
- Training, validation, and test dataset
Multi-class classification
-
Some models can handle multi-class classification naturally
- KNN
- Decision trees
- Neural networks
-
Leverage on binary classifiers
- One-vs.-rest (aka one-vs-all)
- One-vs.-one
Model Selection
-
Bias vs variance
- The test error comes from
-
Bias: the error from the difference between the true model and the learning model
-
Variance: the error from sensifivity to the small fluctuations in the training data
-
Noise: the error from the data per se.
-
- The test error comes from
-
Model Complexity vs error
-
Training size vs error
-
Data size vs model complexity
-
Model selection summary
- Domain knowledge is important
- if you know the relationship between x and y is linear, why choose quadratic?
-
Model complexity is important
- Applying simple models on massive data tends to underfitting
- Select an algorithm that is complex enough to fit the training data well
-
Data size is important
-
Applying complex models on small data tends to overfitting
-
Collect data taht is large enough to prevent overfitting
-
- Domain knowledge is important