Handling Skewed Data
Error Metrics for Skewed Classes
Precision / Recall
-
Precision
(Of all patients where we predicted y = 1, what fraction actually has cancer?)True positives True positives ___________________ = ______________________________ # predicted positive True positive + False Positive
-
Recall
(Of all patients that actually have cancer, what fraction did we correctly detect as having cancer ?)True positives True positives __________________ = _____________________________ # actual positive True positive + False negative
a classifier of a high precision or high recall actually is a good classifier
if a classifier is getting high precision and high recall, then we are actually confident that the algorithm has to be doing well, even if we have very skewed classes. by 大神 !
Trading Off Precision and Recall
- Trading off precision and recall
- Logistic regression: 0 ≤ hθ(x) ≤ 1
- Predict 1 if hθ(x) ≥ 0.5
- Predict 0 if hθ(x) < 0.5
Higher precision, lower recall- 很確切的知道病患有得癌症的機率才告知,避免病患緊張過度!
Higher recall, lower precision - 有可能罹患癌症的時候就告知,怕錯過深度觀察或著治療等
- 很確切的知道病患有得癌症的機率才告知,避免病患緊張過度!
_以上兩種都可以用各自的觀點解讀唷!
Precision / Recall curve
-
More generally: Predict 1 if hθ(x) ≥ threshold
-
different shape :
-
F1 Score(F score)
-
How to compare precision/recall numbers?