CSE 473 - Introduction to Artificial Intelligence
Autumn 2003 - Prof. Henry Kautz

Assignment 8: Spam Filtering by Machine Learning

Your task is to build a system that learns to classify email messages as spam or non-spam using statistical learning techniques.  You will train your system on a training set of your own email that you have manually classified.  Then, you will measure the performance of the system on a test set of your own data, separate from the training data.

You must work on this project in teams.  The size of the team determines the project requirements as follows:

Any team can earn extra credit by doing more than the actual requirements for a team of the given size.

Technical Details

The most straightforward implementation of either kind of classifier considers each message to be the set of words contained in the body of the message, ignoring all headers, punctuation, capitalization, attachments, and repeated words.  Each word is a possible feature of the message: For a naive Bayes classifier, a word corresponds to an observable Boolean variable.  For a neural net classifier a word corresponds to a 0/1 input.  The output of the classifier is a real number:  for the naive Bayes classifier the probability that the input message is spam, or for the neural net classifier the output of the single output unit (which should approach 1 when the message is spam).

You could design your implementation so that it reads email files directly, but it may be more convenient to write a separate program that processes your email to put it in a simpler form first: for example, a separate file for each message, where each message is just a sequence of N zero's and one's, where N is the size of the dictionary of all the possible words that appear in your training set.  (For testing you can just ignore words that were never encountered during training.)  Or, you could process the email in several passes, where the first one breaks messages into separate files, the second converts each file into a sequence of lowercase words separated by a single space or newline, the third creates the dictionary, and the fourth replaces words by the index of the word in the dictionary, and the final pass actually builds the classifier.  The programming language PERL is extremely handy for performing such text-bashing.

As noted above, possible extensions to this model could distinguish lowercase and uppercase words, could consider the number of times a word is used in a message, and so on.

You should organize your data into 4 sets:

You should have at least 200 spam and 200 non-spam messages for training (the more, the better).  It is critically important that you do not train your system on your test data!  While the proportion of spam and non-spam messages need not be equal, the ratio of spam to non-spam in the training set should be the same as the proportion of spam to non-spam in the test set.

The Problem of Small Probabilities

Care must be taken in handling small and zero probabilities in a Bayesian classifier for two reasons:

  1. Zeros in the conditional probability tables can make the classifier fragile.  Suppose that a particular word is never seen in any non-spam in the training data (but it has been seen in some spam, and therefore is in the dictionary).  Then, if the probability of that word conditioned on non-spam is set to zero, then any message containing that word will be classified as spam with probability 100%.  While this may be desirable if the word in question is "Viagra", it is probably not a good idea for the word "money" (especially if you get a message with a job offer from Bill Gates that says "money is no object").  The solution to this problem is to set the conditional probability for an unseen word to a very small, non-zero probability.  The value should be smaller than the smallest value assigned to any word - that is, smaller than the value assigned to a word that is seen only once in the training data.  A simple rule of thumb is to set the probability to 1/4 of the smallest probability assigned to any seen word.  (For more mathematically rigorous ways of solving this problem look for solutions to the "sparse data problem" in a good statistics textbook or on the web.)
  2. Multiplying together small probabilities can lead to numeric underflow.  For example, suppose you are calculating
            P(w1,...,wk|Spam) = P(w1|Spam)...P(wk|Spam)
    where each P(wi|Spam)=0.01 and k=100.  This gives a value of 10**(-200).  This problem can be solved by representing probabilities by their logarithms.  Using logs has the further advantage that multiplication is replaced by addition.  Note that using logs requires you to adopt one of the solutions to problem 1, since you cannot take the log of zero.

Other Sources of Data

Several sets of spam and non-spam email have been collected in public machine learning databases.  Here is a copy of one of them, lingspam_public.tar.gz.  The email messages here have been "semi-digested", by having words converted to lowercase, multiple spaces and newlines deleted, and so on.  You might find it useful to use this data while building and testing your system, before you have your own data set ready to go.

Deadlines

By Monday Nov 24 you must have formed a team.  One person from the team should send email to Jeffrey <jbigham@cs.washington.edu> letting him know who is on your team.  Jeffrey will assign a shared workspace on the instructional servers to your team.  If you have been unable to join a team by that day send email to Jeffrey and he will assign any remaining people to new or existing teams. 

By 10am Friday Dec 5 you should turn in your results using the turn-in procedure described below.  You should turn in both your program code and a write-up of about 4 pages that clearly describes:

Turn-in instructions

Login to one of the undergradaute instructional unix machine such as attu, and then submit your work with a command such as:

turnin -c cse473 yourfile1.c yourfile2.c writeup.txt

You can submit as many files as you want on the command line, but each submission overwrites any previous ones - so you need to make sure to submit everything again if you decide to submit multiple times. For more information check the man page for  turnin.