教材&資訊來自 吳漢銘 老師
- 敘述統計
- 機率分布
- 參數估計
- 假設檢定
- 無母數統計
- 平滑方法
- 缺失值處理
- 資料轉換
- 重抽法則
- 探索式資料分析
用 R 來學習
- Merriam-Webster dictionarydefines statistics as “a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.
Machine Learning is an algorithm that can learn from data without relying on rules-based programming
Statistical Modelling is the formalization of relationships between variables in the form of mathematical equations.
Data Science
Type of Data Scales
Nominal (名目變數), Categorical (類別資料), discrete: 性別、 種族、宗教信仰、交通工具、音樂類型… (qualitative 屬質)。
Ordinal (順序): 精通程度、同意程度、滿意程度、教育程度。
Interval—Distances between values are meaningful, but zero point is not meaningful. (例如:華氏溫度)(不能說:80 度 是40度的兩倍熱)。
Ratio (Continuous Data 連續型資料)—Distances are meaningful and a zero point is meaningful: 年收入、年資、 身高、… (quantitative 計量)。
資料描述: 中心趨勢、分散程度
- 資料中心趨勢:
- average
- mode
- median
- 資料分散程度
- Quartile
- range
- interquartile range, IQR
- percentile
- Standard deviation
- variance
資料描述: 偏態係數(Skewness)
資料描述: 峰態係數(coefficient of kurtosis)
敘述統計: 主題2
- 距離及相似度量測指標
- 相關係數: Pearson’s rho、Spearman’s rho、 Kendall’s tau
- 小樣本數高維度資料問題(HDLSS Problem)
Distance and Similarity Measure
相關係數 Correlation (Pearson, Kendall, Spearman)
High-dimensional data (HDD)
- 高維度資料的三種類型:
- p is large but smaller than n ;
- p is large and larger than n : the high-dimension low sample size data (HDLSS); and
- the data are functions of a continuous variable d : the functional data.
- In high dimension, the space becomes emptieras the dimension increases: when p > n , the rank r of the covariance matrix S satisfies r ≤ min{p, n} .
HDLSS(High Dimension, Low Sample Size)
- Examples:
- Face recognition (images): we have many thousands of variables (pixels), the number of training samples defining a class (person) is usually small (usually less than 10).
- Microarray experiments is unusual for there to be more than 50 repeats ( data points) for several thousand variables (genes).
- The covariance matrix will be singular, and therefore cannot be inverted. In these cases we need to find some method of estimating a full rank covariance matrix to calculate an inverse.
Penalized / Regularized / Shrinkage
老師要我們自己去骨哥一下拉~ XD
主要 Takeaway: 當面對到的資料是屬於 HDLLS 的情況時,要注意到共變異數(Covariance)的估計,和機器學習的方法,需不需要做修正。
A random experiment (隨機實驗) is a process by which we observe something uncertain. After the experiment, the result of the random experiment is known.
Outcome (結果): An outcome is a result of a random experiment.
Sample space (樣本空間) S : the set of all possible outcomes.
- 例子1: 投擲兩硬幣, 正(Head)反(Tail)面之樣本空間S={HH, HT, TH, TT}.
Event (事件) E : an event is a subset of the sample space.
- 例子2: In the context of an experiment, we may define the sample space of observing a person as S = {sick, healthy, dead} . The following are all events: {sick} , {healthy} , {dead} , {sick, healthy} , {sick, dead} , {healthy, dead} , {sick, healthy, dead} , {none of the above} .
_Trial (試驗): a single performance of an experiment whose outcome is in S .
- 例子3: 投擲4枚硬幣的隨機實驗中,每投擲一次硬幣皆是一次「試驗」
Probability (機率): the probability of event E, P(E), is the value approached by the relative frequency of occurrences of Ein a long series of replicationsof a random experiment. (The frequentist view)
Random variable (隨機變數): A function that assigns real numbers to events, including the null event.
Probability Distribution (機率分佈): 是以數學函數的方式來表示隨機實驗中 不同的可能結果(即樣本空間之每個元素) 發生的可能性(機率)。
- 例子: 假如令隨機變數 X 表示是投擲一枚公平硬幣的結 果: X =1為正面, X =0為反面, 則 X 的機率分佈是: P( X =1) = 0.5, P( X =0) = 0.5.
Probabiliity Mass Function (機率質量函數)
In probability and statistics, a probability mass function (PMF) is a function that gives the probability that a discrete random variable is exactly equal to some value.
Probability Density Function (機率密度函數)
In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.
Probability Distribution
不行了! 這樣邊讀遍寫筆記 還要實作 太花時間!!! 先以實作練習為準! 掰掰~ XDDD
Normal distribution,for a single real-valued quantity that grow linearly (e.g.
) (X ~ N(μ, σ^2)) -
Log-normal distribution,for a single positive real-valued quantity that grow exponentially (e.g.
) (log(X) ~ N(μ, σ^2)) -
Discrete uniform distribution, for a finite set of values (e.g.
the outcome of a fair die
) (X ~ Unif({a, b})) -
Binomial distribution, for the number of “positive occurrences” (e.g.
,yes votes
, etc.) given a fixed total number of independent occurrences. (X ~ B(n, p)) -
Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures ( r ) before a given number of successes ( k ) occurs. (X ~ NB(r, p))
Chi-squared distribution, the distribution of a sum of squared standard normal variables; useful e.g. for
regarding thesample variance
of normally distributed samples. ( X ~ χ^2 (d))
CDF (cumulative distribution function) 累積機率分配函數
Quantiles (q) 分位數
很重要!!! wiki
- 常見之分布(二項式分布、常態分布)
- 這邊直接練習code: Github
視覺化~ 好棒棒的網站:https://seeing-theory.brown.edu
- The law of Large Numbers (LLN)
- 由具有有限(finite)平均數μ的母體隨機抽樣,隨 著樣本數n的增加,樣本平均數越接近母體的 平均數μ。
- Central Limit Theorem (CLT)
- 由一具有平均數μ,標準差σ的母體中抽取樣本大小為n的簡單隨機樣 本,當樣本大小n夠大時,樣本平均數的抽樣分配會近似於常態分配
參數估計 (parameter estimation)
利用 樣本統計量 及其抽樣分配來對 母體參數 進行推估, 以 暸解母體的特
- 點估計(動差法、最大概似法、最小平方法)
- 評斷準則: 不偏性、有效性、一致性、最小變異不偏性、充份性
- 區間估計
常見的機率分布: 實作練習
code: Github
概似函數 (The Likehood Function)
最大概似估計法 (Maximum Likehood Estimation- MLE)
區間估計 (Interval Estimation)
概似函數: 實作練習
code: Github
貝式定理 (Bayes’ Theorem)
假設檢定 & 變異數分析
Hypothesis Testing
Type I & Type II Errors
P Values
T test
Exercise 4-1
單因子變異數分析 (One-way ANOVA)
- use when the sample sizes are unequal
- use when the sample sizes are equal but small
Exercise 4-2
- Test for Normality (常態分佈檢定)
- Chi-square Test (卡方檢定)
Formal Tests for Normality
- Histogram
- Q-Q Plot
- 如果是常態分佈,那quantiles 一定落在 45 上。
Which Normality Test Should I Use?
- Kolmogorov-Smirnov test:
- It is more sensitive near the centerof the density than at the tails than other tests;
- For data sets n > 50.
- The Anderson-Darling test:
- A-D test is a modification of the K-S test and gives more weight to the tails of the density than does the K-S test.
- It is generally preferable to the K-S test.
- Shapiro-Wilks test:
- Doesn’t work well if several values in the data set are the same.
- Works best for data sets with n < 50, but can be used with larger data sets.
- W/S test (range(x)/sd(x)):
- simple, but effective.
- Jarque-Beratest(jarque.test{moments}):
- tests for skewness and kurtosis, very effective.
- D’Agostinotest(agostino.test{moments}) :
- powerful omnibus (skewness, kurtosis, centrality) test.
卡方檢定: chisq.test
適合度檢定(test of goodness of fit): 檢定資料 是否符合某個比例關係或某個機率分佈。
齊一性檢定(test of homogeneity): 檢定幾個不 同類別中的比例關係是否一致。
獨立性檢定(test of independence): 檢定兩個分 類變數之間是否互相獨立。
Exercise 5-2
- Missing Data (具缺失值資料)
- Missingess Mechanism
- Missing by Design
- Missing Completely at Random (MCAR)
- Missing at Random (MAR)
- Missing Not at Random (MNAR)
The missing values may give clues to systematic aspects of the problem.
How to deal with missing values:
Use a global constant to fill the value will misguide the mining process. (例如: 缺考給0分; 影像訊號=前景-背景
Use the attribute mean or median for all samples belonging to the same class as the given tuple
補值(Missing value imputation) (most popular)
缺失機制 Missingness Mechanism
- The presence of missing data can
- effect the properties of the estimates (e.g. means, percentages, percentiles, variances, ratios, regression parameters, etc.).
- affect inferences. (e.g., the properties of tests and confidence intervals. )
- The missingnessmechanism(Little and Rubin, 1987)
- The way in which the probability of an item missing depends on other observed or non-observed variables as well as on its own value.
- It helpful to classify missing values on the basis of the stochastic mechanism that produces them.
Missing by Design / Missing Completely at Random
Missing by Design
Excluded some participants from the analysis because they are
not part of
the population under investigation. -
: (i) refused to answer; (ii) answered don’t know; (iii) had a valid skip or (iv) was skipped by an enumerator error.
Missing Completely at Random (MCAR)
- missingness is independent of their own unobserved values and the observeddata.
- Example : Miscoding or forgetting to log in answer.
- Imputation methodsrely on the missingness being of the MCARtype.
- R Packages for Dealing With Missing Values: VIM, MICE
- Visualizing the Pattern of Missing Data
- Traditional Approaches to Handling Missing Data
- Imputation Methods: KNN
- Which Imputation Method?