1 minute read

Tags: ,

資料轉換

  • 為什麼要做資料轉換?
  • 常見的資料轉換方式
  • 對數轉換(Log Transformation)
  • Box-Cox Transformation
  • 標準化(Standardization)
  • 要使用哪㇐種資料轉換方式?

### 為什麼要做資料轉換?

  • to make it more closely the assumptions of a statistical inference procedure,
  • to make it easier to visualize (appearance of graphs),
  • to improve interpretability,
  • to make descriptors that have been measured in different units comparable,
  • to make the relationships among variables linear,
  • to modify the weightsof the variables or objects (e.g. give the same length (or norm) to all object vectors)
  • to codecategorical variables into dummy binary variables

對數轉換

資料數值都必須是正的

   # 自己練習一個
   numbers = np.arange(50)
   plt.scatter(numbers, np.log(numbers))
   plt.show()

對數轉換: How to handle Negative Data Values?

  • Solution 1 : Translate, then Transform
    • log(x + min(x))
     #logx <- function(x){
     #  log(x +1 - min(x))
     #}
     def logx(x):
         a = np.log(x + 1 - min(x))
         return a
          
     #x <- runif(80, min = -5 , max = 5)
     x = np.random.uniform(-5, 5, 80)
          
     #x <- c(x, rnorm(20, mean = 20, sd = 10))
     x = np.concatenate((x, np.random.normal(loc = 20.0, scale = 10.0, size = 20)))
          
     #par(mfrow = c(1,3))
     #hist(x, main = "x~runif")
     plt.hist(x)
     plt.show()
          
     #plot(x, logx(x), main = "x vs logx")
     plt.scatter(x, logx(x))
     plt.show()
          
     #hist(logx(x), main = "logx")
     plt.hist(logx(x))
     plt.show()
    
  • Solution 2 : Missing Values

    • A criticismof the previous method is that some practicing statisticians don’t like to add an arbitrary constant to the data.

    • They argue that a better wayto handle negative values is to use missing values for the logarithm of a nonpositivenumber.

Box-Cox Transformations

Imgur

Standardization

Imgur

標準化指令: python 透過 sklearn

  • API
  • sklearn.preprocessing.scale
 import pandas as pd
 from sklearn.preprocessing import scale
 
 cellraw = pd.read_csv('./Data/trad_alpha103.txt', header = 0, index_col = 0, sep = '\t')
 cellxdata = scale(cellraw.iloc[:, 1:19], axis = 1)

重抽法則

資料不平衡

無母樹統計

平滑技巧

探索是資料分析

  • Coursera: Exploratory Data Analysis

幹! 好多