2 minute read

Tags: , ,

Data Preprocessing

  • Transforming features

    • Min-max scaling: linearly scale each feature to the range [1,1] or [0,1]
    • Standardize: scale each feature to N(0,1)
    • Robust scaling: scale features by removing the outliers
    • Thresholding
    • Applying log or exponential function
  • Why scaling ?

    • For KNN, scaling prevents the distance scores being dominated by few features

    • For linear regression, logistic regression, SVM, scaling helps the optimization

    • For decision trees and random forest, scaling or not is not important

  • Why log or exponential ?

    • positive skewed —> log
      • 原本小的做完log還是小,原本大的做完log還是大。
    • negative skweded —> exponential
      • 原本小的做完log還是小,原本大的做完log還是大。


  • Missing values

    • If the type of the feature is categorical
      • Replace the missing values with the most frequent value
    • If the type of the feature is numerical

      • Replace the missing values with the mean or median
    • Some more advanced techniques

      • Predict the missing values, usually by interpolation and extrapolation
  • Imbalanced classification

    • Most classification algorithms perform optimally when the number of samples in each class is similar

    • Common techniques for imbalanced dataset
      • Under-sample the majority class
      • Over-sample the minority class
      • A combination of both under-sampling and over-sampling
    • Some more advanced techniques
      • “Generate” simulated instances from the minority class
  • Data Preprocessing Summary

    • Consider transforming features

    • Fill missing values

    • Consider generating synthetic instances


  • Selecting hyper-parameters

    • Example of hyper-parameters

      • K in KNN
      • C in the regularized linear classification
      • step size in gradient descent
    • Strategies

      • Grid search
        • A fancy name of exhaustive search
        • Manually specify subset of the hyper-parameter space and step size
        • Try all parameter combinations and check thier performance
      • Random search
        • Suprisingly good performance
          • Among the tunable parameters, important ones are only a few
      • Bayesian optimization
        • Iteratively picking hyper-parameters for experiments
        • Picking strategy: tradeoff between
          • Exploration (hyper-parameters for which the outcome is most uncertain), and

          • Exploitation (hy-er-parameters which are expected to have a good outcome)

  • Random serch is surprisingly good


  • Validating selected hyper-patameters

    Imgur Imgur Imgur

  • Summary of hyper-parameter tuning

    • Grid search, random search, and Bayesian optimization
    • Training, validation, and test dataset

Multi-class classification

  • Some models can handle multi-class classification naturally

    • KNN
    • Decision trees
    • Neural networks
  • Leverage on binary classifiers

Model Selection

  • Bias vs variance

    • The test error comes from
      • Bias: the error from the difference between the true model and the learning model

      • Variance: the error from sensifivity to the small fluctuations in the training data

      • Noise: the error from the data per se.


  • Model Complexity vs error


  • Training size vs error



  • Data size vs model complexity


  • Model selection summary

    • Domain knowledge is important
      • if you know the relationship between x and y is linear, why choose quadratic?
    • Model complexity is important

      • Applying simple models on massive data tends to underfitting
      • Select an algorithm that is complex enough to fit the training data well
    • Data size is important

      • Applying complex models on small data tends to overfitting

      • Collect data taht is large enough to prevent overfitting