機率與統計

July 31, 2019 6 minute read

Tags: aiacademy, statistics

教材&資訊來自吳漢銘老師

機率與統計課程大綱

敘述統計
機率分布
參數估計
假設檢定
無母數統計
平滑方法
缺失值處理
資料轉換
重抽法則
探索式資料分析

用 R 來學習

什麼是統計?

Merriam-Webster dictionarydefines statistics as “a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.

統計模型、資料探勘、機器學習

Machine Learning is an algorithm that can learn from data without relying on rules-based programming
Statistical Modelling is the formalization of relationships between variables in the form of mathematical equations.

Data Science

Type of Data Scales

Nominal (名目變數), Categorical (類別資料), discrete: 性別、種族、宗教信仰、交通工具、音樂類型… (qualitative 屬質)。
Ordinal (順序): 精通程度、同意程度、滿意程度、教育程度。
Interval—Distances between values are meaningful, but zero point is not meaningful. (例如:華氏溫度)(不能說：80 度是４0度的兩倍熱)。
Ratio (Continuous Data 連續型資料)—Distances are meaningful and a zero point is meaningful: 年收入、年資、身高、… (quantitative 計量)。

資料描述: 中心趨勢、分散程度

資料中心趨勢:
- average
- mode
- median
資料分散程度
- Quartile
- range
- interquartile range, IQR
- percentile
- Standard deviation
  - $std$
- variance

資料描述: 偏態係數(Skewness)

好棒棒補腦網頁
好棒棒補腦網頁2

skewness

skewness-chart

資料描述: 峰態係數(coefficient of kurtosis)

Imgur

實際範例:

財稅大數據探討台灣近年薪資樣貌

點我看看

薪資平台

點我看看

敘述統計: 主題2

距離及相似度量測指標
相關係數: Pearson’s rho、Spearman’s rho、 Kendall’s tau
小樣本數高維度資料問題(HDLSS Problem)

Distance and Similarity Measure

Imgur

High-dimensional data (HDD)

高維度資料的三種類型:
- p is large but smaller than n ;
- p is large and larger than n : the high-dimension low sample size data (HDLSS); and
- the data are functions of a continuous variable d : the functional data.
In high dimension, the space becomes emptieras the dimension increases: when p > n , the rank r of the covariance matrix S satisfies r ≤ min{p, n} .

HDLSS(High Dimension, Low Sample Size)

Examples:
- Face recognition (images): we have many thousands of variables (pixels), the number of training samples defining a class (person) is usually small (usually less than 10).
- Microarray experiments is unusual for there to be more than 50 repeats ( data points) for several thousand variables (genes).
The covariance matrix will be singular, and therefore cannot be inverted. In these cases we need to find some method of estimating a full rank covariance matrix to calculate an inverse.

Penalized / Regularized / Shrinkage

在高維度的狀態下有這幾招可以幫助消化高維度資料

老師要我們自己去骨哥一下拉~ XD

主要 Takeaway: 當面對到的資料是屬於 HDLLS 的情況時，要注意到共變異數（Covariance）的估計，和機器學習的方法，需不需要做修正。

機率分布:

常見統計名詞

A random experiment (隨機實驗) is a process by which we observe something uncertain. After the experiment, the result of the random experiment is known.
Outcome (結果): An outcome is a result of a random experiment.
Sample space (樣本空間) S : the set of all possible outcomes.
- 例子1: 投擲兩硬幣, 正(Head)反(Tail)面之樣本空間S={HH, HT, TH, TT}.
Event (事件) E : an event is a subset of the sample space.
- 例子2: In the context of an experiment, we may define the sample space of observing a person as S = {sick, healthy, dead} . The following are all events: {sick} , {healthy} , {dead} , {sick, healthy} , {sick, dead} , {healthy, dead} , {sick, healthy, dead} , {none of the above} .
_Trial (試驗): a single performance of an experiment whose outcome is in S .
- 例子3: 投擲4枚硬幣的隨機實驗中，每投擲一次硬幣皆是一次「試驗」

機率與隨機變數

Probability (機率): the probability of event E, P(E), is the value approached by the relative frequency of occurrences of Ein a long series of replicationsof a random experiment. (The frequentist view)
Random variable (隨機變數): A function that assigns real numbers to events, including the null event.
Probability Distribution (機率分佈): 是以數學函數的方式來表示隨機實驗中不同的可能結果(即樣本空間之每個元素) 發生的可能性(機率)。
- 例子: 假如令隨機變數 X 表示是投擲一枚公平硬幣的結果: X =1為正面， X =0為反面，則 X 的機率分佈是: P( X =1) = 0.5, P( X =0) = 0.5.

Probabiliity Mass Function (機率質量函數)

In probability and statistics, a probability mass function (PMF) is a function that gives the probability that a discrete random variable is exactly equal to some value.
PMF:Wiki

Probability Density Function (機率密度函數)

In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.
PDF:Wiki

Probability Distribution

不行了! 這樣邊讀遍寫筆記還要實作太花時間!!! 先以實作練習為準! 掰掰~ XDDD

常用機率分布的應用

Normal distribution,for a single real-valued quantity that grow linearly (e.g. errors, offsets) (X ~ N(μ, σ^2))
Log-normal distribution,for a single positive real-valued quantity that grow exponentially (e.g. prices, incomes, populations) (log(X) ~ N(μ, σ^2))
Discrete uniform distribution, for a finite set of values (e.g. the outcome of a fair die) (X ~ Unif({a, b}))
Binomial distribution, for the number of “positive occurrences” (e.g. successes, yes votes, etc.) given a fixed total number of independent occurrences. (X ~ B(n, p))
Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures ( r ) before a given number of successes ( k ) occurs. (X ~ NB(r, p))
Chi-squared distribution, the distribution of a sum of squared standard normal variables; useful e.g. for inference regarding the sample variance of normally distributed samples. ( X ~ χ^2 (d))

CDF (cumulative distribution function) 累積機率分配函數

Imgur

Quantiles (q) 分位數

很重要！！！ wiki

Imgur

常見的機率分布

常見之分布(二項式分布、常態分布)
以常態機率逼近二項式機率
這邊直接練習code: Github

大數法則(LLN)、中央極限定理(CLT)

視覺化~ 好棒棒的網站：https://seeing-theory.brown.edu

The law of Large Numbers (LLN)
- 由具有有限(finite)平均數μ的母體隨機抽樣，隨著樣本數n的增加，樣本平均數越接近母體的平均數μ。
Central Limit Theorem (CLT)
- 由一具有平均數μ，標準差σ的母體中抽取樣本大小為n的簡單隨機樣本，當樣本大小n夠大時，樣本平均數的抽樣分配會近似於常態分配

參數估計 (parameter estimation)

利用 樣本統計量 及其抽樣分配來對 母體參數 進行推估, 以暸解母體的特
- 點估計(動差法、最大概似法、最小平方法)
  - 評斷準則: 不偏性、有效性、一致性、最小變異不偏性、充份性
- 區間估計

常見的機率分布: 實作練習

code: Github

概似函數　(The Likehood Function)

Imgur

最大概似估計法 (Maximum Likehood Estimation- MLE)

Imgur

區間估計　(Interval Estimation)

Imgur

概似函數: 實作練習

code: Github

貝式定理 (Bayes’ Theorem)

Imgur

假設檢定 & 變異數分析

Hypothesis Testing

Type I & Type II Errors

Imgur

P Values

Imgur

T test

Imgur

Exercise 4-1

github

單因子變異數分析 (One-way ANOVA)

Imgur

ANOVA Table

anova

Welch ANOVA

use when the sample sizes are unequal
use when the sample sizes are equal but small

Exercise 4-2

小圓籃球細胞！！！　

github

無母數統計

Test for Normality (常態分佈檢定)
Chi-square Test (卡方檢定)

Formal Tests for Normality

Histogram

histogram

Q-Q Plot
- 如果是常態分佈，那quantiles 一定落在 45 上。

qqplot

Which Normality Test Should I Use?

Kolmogorov-Smirnov test:
- It is more sensitive near the centerof the density than at the tails than other tests;
- For data sets n > 50.
The Anderson-Darling test:
- A-D test is a modification of the K-S test and gives more weight to the tails of the density than does the K-S test.
- It is generally preferable to the K-S test.
Shapiro-Wilks test:
- Doesn’t work well if several values in the data set are the same.
- Works best for data sets with n < 50, but can be used with larger data sets.
W/S test (range(x)/sd(x)):
- simple, but effective.
Jarque-Beratest(jarque.test{moments}):
- tests for skewness and kurtosis, very effective.
D’Agostinotest(agostino.test{moments}) :
- powerful omnibus (skewness, kurtosis, centrality) test.

Imgur

卡方檢定: chisq.test

適合度檢定(test of goodness of fit): 檢定資料是否符合某個比例關係或某個機率分佈。
齊一性檢定(test of homogeneity): 檢定幾個不同類別中的比例關係是否一致。
獨立性檢定(test of independence): 檢定兩個分類變數之間是否互相獨立。

Exercise 5-2

github

缺失值處裡

Missing Data (具缺失值資料)
Missingess Mechanism
- Missing by Design
- Missing Completely at Random (MCAR)
- Missing at Random (MAR)
- Missing Not at Random (MNAR)

缺失值的處裡

The missing values may give clues to systematic aspects of the problem.
How to deal with missing values:
- Use a global constant to fill the value will misguide the mining process. (例如: 缺考給0分; 影像訊號=前景-背景
- Use the attribute mean or median for all samples belonging to the same class as the given tuple
- 補值(Missing value imputation) (most popular)

缺失機制　Missingness Mechanism

The presence of missing data can
- effect the properties of the estimates (e.g. means, percentages, percentiles, variances, ratios, regression parameters, etc.).
- affect inferences. (e.g., the properties of tests and confidence intervals. )
The missingnessmechanism(Little and Rubin, 1987)
- The way in which the probability of an item missing depends on other observed or non-observed variables as well as on its own value.
It helpful to classify missing values on the basis of the stochastic mechanism that produces them.

Missing by Design / Missing Completely at Random

Missing by Design
- Excluded some participants from the analysis because they are not part of the population under investigation.
- missingnesscodes: (i) refused to answer; (ii) answered don’t know; (iii) had a valid skip or (iv) was skipped by an enumerator error.
Missing Completely at Random (MCAR)
- missingness is independent of their own unobserved values and the observeddata.
- Example : Miscoding or forgetting to log in answer.
- Imputation methodsrely on the missingness being of the MCARtype.

缺失值處理-大綱

R Packages for Dealing With Missing Values: VIM, MICE
Visualizing the Pattern of Missing Data
Traditional Approaches to Handling Missing Data
Imputation Methods: KNN
Which Imputation Method?

Exercise 7-1

github

Twitter Facebook LinkedIn

阿葛廷

機率與統計課程大綱

用 R 來學習

什麼是統計?

統計模型、資料探勘、機器學習

Data Science

Type of Data Scales

資料描述: 中心趨勢、分散程度

資料描述: 偏態係數(Skewness)

資料描述: 峰態係數(coefficient of kurtosis)

實際範例:

財稅大數據探討台灣近年薪資樣貌

薪資平台

敘述統計: 主題2

Distance and Similarity Measure

相關係數 Correlation (Pearson, Kendall, Spearman)

High-dimensional data (HDD)

HDLSS(High Dimension, Low Sample Size)

Penalized / Regularized / Shrinkage

機率分布:

常見統計名詞

機率與隨機變數

Probabiliity Mass Function (機率質量函數)

Probability Density Function (機率密度函數)

Probability Distribution

常用機率分布的應用

CDF (cumulative distribution function) 累積機率分配函數

Quantiles (q) 分位數

常見的機率分布

大數法則(LLN)、中央極限定理(CLT)

參數估計 (parameter estimation)

常見的機率分布: 實作練習

概似函數 (The Likehood Function)

最大概似估計法 (Maximum Likehood Estimation- MLE)

區間估計 (Interval Estimation)

概似函數: 實作練習

貝式定理 (Bayes’ Theorem)

假設檢定 & 變異數分析

Type I & Type II Errors

P Values

T test

Exercise 4-1

單因子變異數分析 (One-way ANOVA)

ANOVA Table

Welch ANOVA

Exercise 4-2

無母數統計

Formal Tests for Normality

Which Normality Test Should I Use?

卡方檢定: chisq.test

Exercise 5-2

缺失值處裡

缺失值的處裡

缺失機制 Missingness Mechanism

Missing by Design / Missing Completely at Random

缺失值處理-大綱

Exercise 7-1

You May Also Enjoy

再不說些話我都快被ＡＩ淹沒拉！！！！

daily Programming: 寶哥出場 品質保證 又見 GitHub Copilot!

daily Programming: Azure AI Search

當你覺得對方很Ｇ8討厭的時候，放面鏡子照照自己剛剛的行為先吧！

概似函數　(The Likehood Function)

區間估計　(Interval Estimation)

缺失機制　Missingness Mechanism

daily Programming: 寶哥出場品質保證又見 GitHub Copilot!