aiacademy: 機器學習 Practical Concerns

August 10, 2019 2 minute read

Tags: aiacademy, machine-learning, ml-practical-concerns

Data Preprocessing

Transforming features
- Min-max scaling: linearly scale each feature to the range [1,1] or [0,1]
- Standardize: scale each feature to N(0,1)
- Robust scaling: scale features by removing the outliers
- Thresholding
- Applying log or exponential function
Why scaling ?
- For KNN, scaling prevents the distance scores being dominated by few features
- For linear regression, logistic regression, SVM, scaling helps the optimization
- For decision trees and random forest, scaling or not is not important
Why log or exponential ?
- positive skewed —> log
  - 原本小的做完log還是小，原本大的做完log還是大。
- negative skweded —> exponential
  - 原本小的做完log還是小，原本大的做完log還是大。
Missing values
- If the type of the feature is categorical
  - Replace the missing values with the most frequent value
- If the type of the feature is numerical
  - Replace the missing values with the mean or median
- Some more advanced techniques
  - Predict the missing values, usually by interpolation and extrapolation
Imbalanced classification
- Most classification algorithms perform optimally when the number of samples in each class is similar
- Common techniques for imbalanced dataset
  - Under-sample the majority class
  - Over-sample the minority class
  - A combination of both under-sampling and over-sampling
- Some more advanced techniques
  - “Generate” simulated instances from the minority class
Data Preprocessing Summary
- Consider transforming features
- Fill missing values
- Consider generating synthetic instances

Hyper-parameters

Selecting hyper-parameters
- Example of hyper-parameters
  - K in KNN
  - C in the regularized linear classification
  - step size in gradient descent
- Strategies
  - Grid search
    - A fancy name of exhaustive search
    - Manually specify subset of the hyper-parameter space and step size
    - Try all parameter combinations and check thier performance
  - Random search
    - Suprisingly good performance
      - Among the tunable parameters, important ones are only a few
  - Bayesian optimization
    - Iteratively picking hyper-parameters for experiments
    - Picking strategy: tradeoff between
      - Exploration (hyper-parameters for which the outcome is most uncertain), and
      - Exploitation (hy-er-parameters which are expected to have a good outcome)
Random serch is surprisingly good
Validating selected hyper-patameters
Summary of hyper-parameter tuning
- Grid search, random search, and Bayesian optimization
- Training, validation, and test dataset

Multi-class classification

Some models can handle multi-class classification naturally
- KNN
- Decision trees
- Neural networks
Leverage on binary classifiers
- One-vs.-rest (aka one-vs-all)
  - 我的筆記
- One-vs.-one

Model Selection

Bias vs variance
- The test error comes from
  - Bias: the error from the difference between the true model and the learning model
  - Variance: the error from sensifivity to the small fluctuations in the training data
  - Noise: the error from the data per se.
Model Complexity vs error
Training size vs error

我的筆記

Imgur

Data size vs model complexity
Model selection summary
- Domain knowledge is important
  - if you know the relationship between x and y is linear, why choose quadratic?
- Model complexity is important
  - Applying simple models on massive data tends to underfitting
  - Select an algorithm that is complex enough to fit the training data well
- Data size is important
  - Applying complex models on small data tends to overfitting
  - Collect data taht is large enough to prevent overfitting

Twitter Facebook LinkedIn

阿葛廷

aiacademy: 機器學習 Practical Concerns

Data Preprocessing

Hyper-parameters

Multi-class classification

Model Selection

You May Also Enjoy

再不說些話我都快被ＡＩ淹沒拉！！！！

daily Programming: 寶哥出場品質保證又見 GitHub Copilot!

daily Programming: Azure AI Search

當你覺得對方很Ｇ8討厭的時候，放面鏡子照照自己剛剛的行為先吧！

阿葛廷

Data Preprocessing

Hyper-parameters

Multi-class classification

Model Selection

You May Also Enjoy

再不說些話我都快被ＡＩ淹沒拉！！！！

daily Programming: 寶哥出場 品質保證 又見 GitHub Copilot!

daily Programming: Azure AI Search

當你覺得對方很Ｇ8討厭的時候，放面鏡子照照自己剛剛的行為先吧！

daily Programming: 寶哥出場品質保證又見 GitHub Copilot!