cse 473 autumn 2003

CSE 473 - Introduction to Artificial Intelligence
Autumn 2003 - Prof. Henry Kautz

Assignment 8: Spam Filtering by Machine Learning

Your task is to build a system that learns to classify email messages as spam or non-spam using statistical learning techniques. You will train your system on a training set of your own email that you have manually classified. Then, you will measure the performance of the system on a test set of your own data, separate from the training data.

You must work on this project in teams. The size of the team determines the project requirements as follows:

A team of 2 people must implement and test a naive Bayes classifier OR a single-unit feed-forward neural net classifier.
A team of 3 people must implement and test a naive Bayes classifier AND a single-unit neural net classifier.
A team of 4 people must implement and test a naive Bayes classifier AND a single-unit neural net classifier AND one of the following extensions:

A 2-layer neural net classifier with a hidden layer of 20 units. Note that this will require you to implement the back-propagation algorithm for training. Compare performance to the single-unit classifier. (It might or might not be better!)
Use hand-crafted features in addition to the words in the message as input. For example, a feature might be whether the sender of the message uses a free email account from hotmail, yahoo, etc. Compare the performance of your system with and without the hand-crafted features.
Determine if considering word capitalization, punctuation, and/or the number of times a word is used in a message to enlarge the set of features improves performance.
Determine how sensitive your system is to the training data by testing on data that was collected by a person different from the one who collected the training data. Compare against the accuracy against the case where the training and test data were collected by the same person.
Determine how the amount of training data affects the accuracy of the system. For example, train on 25 pieces of spam, and then test; then train on 50 pieces, and test; and so on. Do you reach a point at which additional training data offers little or no improvement in accuracy?
Determine how the type I and type II errors change as the threshold for deciding that a piece of email is spam is raised in 5% increments from probability 50% to 95%.
Other interesting extensions or experiments you can think of - just discuss what you would like to try with me in advance.

Any team can earn extra credit by doing more than the actual requirements for a team of the given size.

Technical Details

The most straightforward implementation of either kind of classifier considers each message to be the set of words contained in the body of the message, ignoring all headers, punctuation, capitalization, attachments, and repeated words. Each word is a possible feature of the message: For a naive Bayes classifier, a word corresponds to an observable Boolean variable. For a neural net classifier a word corresponds to a 0/1 input. The output of the classifier is a real number: for the naive Bayes classifier the probability that the input message is spam, or for the neural net classifier the output of the single output unit (which should approach 1 when the message is spam).

You could design your implementation so that it reads email files directly, but it may be more convenient to write a separate program that processes your email to put it in a simpler form first: for example, a separate file for each message, where each message is just a sequence of N zero's and one's, where N is the size of the dictionary of all the possible words that appear in your training set. (For testing you can just ignore words that were never encountered during training.) Or, you could process the email in several passes, where the first one breaks messages into separate files, the second converts each file into a sequence of lowercase words separated by a single space or newline, the third creates the dictionary, and the fourth replaces words by the index of the word in the dictionary, and the final pass actually builds the classifier. The programming language PERL is extremely handy for performing such text-bashing.

As noted above, possible extensions to this model could distinguish lowercase and uppercase words, could consider the number of times a word is used in a message, and so on.

You should organize your data into 4 sets:

A training set of non-spam messages. This should be 75% of the non-spam messages you collected. (You can use data collected by just one member of the team or by all the members.)
A training set of spam messages. This should be 75% of the spam messages you collected.
A test set of non-spam messages. This should be 25% of the non-spam messages you collected.
A training set of spam messages. This should be 25% of the spam messages you collected.

You should have at least 200 spam and 200 non-spam messages for training (the more, the better). It is critically important that you do not train your system on your test data! While the proportion of spam and non-spam messages need not be equal, the ratio of spam to non-spam in the training set should be the same as the proportion of spam to non-spam in the test set.

The Problem of Small Probabilities

Care must be taken in handling small and zero probabilities in a Bayesian classifier for two reasons:

Zeros in the conditional probability tables can make the classifier fragile. Suppose that a particular word is never seen in any non-spam in the training data (but it has been seen in some spam, and therefore is in the dictionary). Then, if the probability of that word conditioned on non-spam is set to zero, then any message containing that word will be classified as spam with probability 100%. While this may be desirable if the word in question is "Viagra", it is probably not a good idea for the word "money" (especially if you get a message with a job offer from Bill Gates that says "money is no object"). The solution to this problem is to set the conditional probability for an unseen word to a very small, non-zero probability. The value should be smaller than the smallest value assigned to any word - that is, smaller than the value assigned to a word that is seen only once in the training data. A simple rule of thumb is to set the probability to 1/4 of the smallest probability assigned to any seen word. (For more mathematically rigorous ways of solving this problem look for solutions to the "sparse data problem" in a good statistics textbook or on the web.)
Multiplying together small probabilities can lead to numeric underflow. For example, suppose you are calculating
P(w1,...,wk|Spam) = P(w1|Spam)...P(wk|Spam)
where each P(wi|Spam)=0.01 and k=100. This gives a value of 10**(-200). This problem can be solved by representing probabilities by their logarithms. Using logs has the further advantage that multiplication is replaced by addition. Note that using logs requires you to adopt one of the solutions to problem 1, since you cannot take the log of zero.

Other Sources of Data

Several sets of spam and non-spam email have been collected in public machine learning databases. Here is a copy of one of them, lingspam_public.tar.gz. The email messages here have been "semi-digested", by having words converted to lowercase, multiple spaces and newlines deleted, and so on. You might find it useful to use this data while building and testing your system, before you have your own data set ready to go.

Deadlines

By Monday Nov 24 you must have formed a team. One person from the team should send email to Jeffrey <jbigham@cs.washington.edu> letting him know who is on your team. Jeffrey will assign a shared workspace on the instructional servers to your team. If you have been unable to join a team by that day send email to Jeffrey and he will assign any remaining people to new or existing teams.

By 10am Friday Dec 5 you should turn in your results using the turn-in procedure described below. You should turn in both your program code and a write-up of about 4 pages that clearly describes:

What you implemented, including any extensions
A high-level description of the data structures and algorithms you used, and significant design decisions
How you evaluated your system: How many messages were used for training? For testing? Were they gathered by one person or several?
Measurements of the performance of your system. At a minimum this should include:

Type I error: What percentage of test spam messages were incorrectly classified as non-spam? For naive Bayes, classify a message as spam if its probability of being spam is > 50%.
Type II error: What percentage of test non-spam messages were incorrectly classified as spam?

A brief discussion of insights you gained.

Turn-in instructions

Login to one of the undergradaute instructional unix machine such as attu, and then submit your work with a command such as:

turnin -c cse473 yourfile1.c yourfile2.c writeup.txt

You can submit as many files as you want on the command line, but each submission overwrites any previous ones - so you need to make sure to submit everything again if you decide to submit multiple times. For more information check the man page for turnin.

CSE 473 - Introduction to Artificial Intelligence Autumn 2003 - Prof. Henry Kautz