building a spam classifier

July 25, 2019 1 minute read

對的事情做，不對的事情不要做，認真做。

每日一 seafood ~~~~

Prioitizing What to Work On : article

Machine learning System Design

Bulding a spam classifier as an example:

Supervised learning.

 x = features of email.
 y = spam(1) or not spam(0)
 Features x: Choose 100 words indicative of spam/not spam
      
 # Note: 
 In practice, take most frequently occurring n words (10,000 to 50,000) 
 in training set, rather than manually pick 100 words.

 ex:
    deal, buy, discont, now, andrew, ...

         | 0 | andrew
         | 1 | buy
         | 1 | deal
    x =  | 0 | discount
         | . | .
         | . | .
         | 1 | now
         | . | .
         | . | .          ,  xj = { 1 if word j appears in email
                                  { 0 otherwise
  --------------------------------------------
   From: cheeapsales@buystufffromme.com
   To: ang@cs.stanford.edu

   Deal of the week! Buy now!

improving the accuracy of this classifier
- Collect lots of data
- Develop sophisticated features (ex: using email header data in spam emails)
- Develop algorithms to process your input in different ways (ex: recognizing misspellings in spam)
it is difficult to tell which of the options will be most helpful

Error Analysis : article

Recommended approach

Start with a simple algorithm that you can implement quickly. Implement it and test it on your cross-validation data.
Plot leraning curves to decide if more datra, more features, ect. are likely to help
Error analysis: Manually examine the examples (in cross validation set) that your algorithm made errors on. See if you spot any systematic trend in what type of examples it is making errors on.

Error Analysis Imgur

VERY IMPORTANT: to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm’s performance.

Twitter Facebook LinkedIn

阿葛廷

building a spam classifier

Prioitizing What to Work On : article

Error Analysis : article

VERY IMPORTANT: to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm’s performance.

You May Also Enjoy

再不說些話我都快被ＡＩ淹沒拉！！！！

daily Programming: 寶哥出場品質保證又見 GitHub Copilot!

daily Programming: Azure AI Search

當你覺得對方很Ｇ8討厭的時候，放面鏡子照照自己剛剛的行為先吧！

阿葛廷

Prioitizing What to Work On : article

Error Analysis : article

VERY IMPORTANT: to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm’s performance.

You May Also Enjoy

再不說些話我都快被ＡＩ淹沒拉！！！！

daily Programming: 寶哥出場 品質保證 又見 GitHub Copilot!

daily Programming: Azure AI Search

當你覺得對方很Ｇ8討厭的時候，放面鏡子照照自己剛剛的行為先吧！

daily Programming: 寶哥出場品質保證又見 GitHub Copilot!