aiacademy: 機器學習 Practical Concerns
Data Preprocessing

Transforming features
 Minmax scaling: linearly scale each feature to the range [1,1] or [0,1]
 Standardize: scale each feature to N(0,1)
 Robust scaling: scale features by removing the outliers
 Thresholding
 Applying log or exponential function

Why scaling ?

For KNN, scaling prevents the distance scores being dominated by few features

For linear regression, logistic regression, SVM, scaling helps the optimization

For decision trees and random forest, scaling or not is not important


Why log or exponential ?
 positive skewed —> log
 原本小的做完log還是小，原本大的做完log還是大。
 negative skweded —> exponential
 原本小的做完log還是小，原本大的做完log還是大。
 positive skewed —> log

Missing values
 If the type of the feature is categorical
 Replace the missing values with the most frequent value

If the type of the feature is numerical
 Replace the missing values with the mean or median

Some more advanced techniques
 Predict the missing values, usually by interpolation and extrapolation
 If the type of the feature is categorical

Imbalanced classification

Most classification algorithms perform optimally when the number of samples in each class is similar
 Common techniques for imbalanced dataset
 Undersample the majority class
 Oversample the minority class
 A combination of both undersampling and oversampling
 Some more advanced techniques
 “Generate” simulated instances from the minority class


Data Preprocessing Summary

Consider transforming features

Fill missing values

Consider generating synthetic instances

Hyperparameters

Selecting hyperparameters

Example of hyperparameters
 K in KNN
 C in the regularized linear classification
 step size in gradient descent

Strategies
 Grid search
 A fancy name of exhaustive search
 Manually specify subset of the hyperparameter space and step size
 Try all parameter combinations and check thier performance
 Random search
 Suprisingly good performance
 Among the tunable parameters, important ones are only a few
 Suprisingly good performance
 Bayesian optimization
 Iteratively picking hyperparameters for experiments
 Picking strategy: tradeoff between

Exploration (hyperparameters for which the outcome is most uncertain), and

Exploitation (hyerparameters which are expected to have a good outcome)

 Grid search


Random serch is surprisingly good

Validating selected hyperpatameters

Summary of hyperparameter tuning
 Grid search, random search, and Bayesian optimization
 Training, validation, and test dataset
Multiclass classification

Some models can handle multiclass classification naturally
 KNN
 Decision trees
 Neural networks

Leverage on binary classifiers
 Onevs.rest (aka onevsall)
 Onevs.one
Model Selection

Bias vs variance
 The test error comes from

Bias: the error from the difference between the true model and the learning model

Variance: the error from sensifivity to the small fluctuations in the training data

Noise: the error from the data per se.

 The test error comes from

Model Complexity vs error

Training size vs error

Data size vs model complexity

Model selection summary
 Domain knowledge is important
 if you know the relationship between x and y is linear, why choose quadratic?

Model complexity is important
 Applying simple models on massive data tends to underfitting
 Select an algorithm that is complex enough to fit the training data well

Data size is important

Applying complex models on small data tends to overfitting

Collect data taht is large enough to prevent overfitting

 Domain knowledge is important