building a spam classifier
對的事情做,不對的事情不要做,認真做。
每日一 seafood ~~~~
Prioitizing What to Work On : article
Machine learning System Design
-
Bulding a spam classifier as an example:
-
Supervised learning.
x = features of email. y = spam(1) or not spam(0) Features x: Choose 100 words indicative of spam/not spam # Note: In practice, take most frequently occurring n words (10,000 to 50,000) in training set, rather than manually pick 100 words. ex: deal, buy, discont, now, andrew, ... | 0 | andrew | 1 | buy | 1 | deal x = | 0 | discount | . | . | . | . | 1 | now | . | . | . | . , xj = { 1 if word j appears in email { 0 otherwise -------------------------------------------- From: cheeapsales@buystufffromme.com To: ang@cs.stanford.edu Deal of the week! Buy now!
-
improving the accuracy of this classifier
- Collect lots of data
- Develop sophisticated features (ex: using email header data in spam emails)
- Develop algorithms to process your input in different ways (ex: recognizing misspellings in spam)
it is difficult to tell which of the options will be most helpful
-
Error Analysis : article
Recommended approach
- Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
-
Plot leraning curves
to decide if more datra, more features, ect. are likely to help - Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.
Error Analysis