CSE 473 - Introduction to Artificial
Intelligence
Autumn 2003 - Prof. Henry Kautz
Assignment 8: Spam Filtering by
Machine Learning
Your task is to build a system that learns to classify email messages as
spam or non-spam using statistical learning techniques. You will train
your system on a training set of your own email that you have manually
classified. Then, you will measure the performance of the system on a
test set of your own data, separate from the training data.
You must work on this project in teams. The size of the team
determines the project requirements as follows:
- A team of 2 people must
implement and test a naive Bayes classifier OR a single-unit feed-forward
neural net classifier.
- A team of 3 people must
implement and test a naive Bayes classifier AND a single-unit neural net
classifier.
- A team of 4 people must
implement and test a naive Bayes classifier AND a single-unit neural net
classifier AND one of the following extensions:
- A 2-layer neural net
classifier with a hidden layer of 20 units. Note that this will
require you to implement the back-propagation algorithm for
training. Compare performance to the single-unit classifier.
(It might or might not be better!)
- Use hand-crafted
features in addition to the words in the message as input. For
example, a feature might be whether the sender of the message uses a free
email account from hotmail, yahoo, etc. Compare the performance of
your system with and without the hand-crafted features.
- Determine if
considering word capitalization, punctuation, and/or the number of times
a word is used in a message to enlarge the set of features improves
performance.
- Determine how
sensitive your system is to the training data by testing on data that was
collected by a person different from the one who collected the training
data. Compare against the accuracy against the case where the
training and test data were collected by the same person.
- Determine how the
amount of training data affects the accuracy of the system. For
example, train on 25 pieces of spam, and then test; then train on 50
pieces, and test; and so on. Do you reach a point at which
additional training data offers little or no improvement in accuracy?
- Determine how the type
I and type II errors change as the threshold for deciding that a piece of
email is spam is raised in 5% increments from probability 50% to 95%.
- Other interesting
extensions or experiments you can think of - just discuss what you would
like to try with me in advance.
Any team can earn extra credit by doing more than the actual requirements
for a team of the given size.
Technical Details
The most straightforward implementation of either kind of classifier
considers each message to be the set of words contained in the body of the
message, ignoring all headers, punctuation, capitalization, attachments, and
repeated words. Each word is a possible feature of the message: For a
naive Bayes classifier, a word corresponds to an observable Boolean
variable. For a neural net classifier a word corresponds to a 0/1
input. The output of the classifier is a real number: for the naive
Bayes classifier the probability that the input message is spam, or for the
neural net classifier the output of the single output unit (which should approach
1 when the message is spam).
You could design your implementation so that it reads email files directly,
but it may be more convenient to write a separate program that processes your
email to put it in a simpler form first: for example, a separate file for each
message, where each message is just a sequence of N zero's and one's, where N
is the size of the dictionary of all the possible words that appear in your
training set. (For testing you can just ignore words that were never
encountered during training.) Or, you could process the email in several
passes, where the first one breaks messages into separate files, the second
converts each file into a sequence of lowercase words separated by a single
space or newline, the third creates the dictionary, and the fourth replaces
words by the index of the word in the dictionary, and the final pass actually
builds the classifier. The programming language PERL is extremely handy
for performing such text-bashing.
As noted above, possible extensions to this model could distinguish
lowercase and uppercase words, could consider the number of times a word is
used in a message, and so on.
You should organize your data into 4 sets:
- A training set of non-spam
messages. This should be 75% of the non-spam messages you collected.
(You can use data collected by just one member of the team or by all the
members.)
- A training set of spam
messages. This should be 75% of the spam messages you collected.
- A test set of non-spam
messages. This should be 25% of the non-spam messages you
collected.
- A training set of spam
messages. This should be 25% of the spam messages you collected.
You should have at least 200 spam and 200 non-spam messages for training
(the more, the better). It is critically important that you do not train
your system on your test data! While the proportion of spam and non-spam
messages need not be equal, the ratio of spam to non-spam in the training set
should be the same as the proportion of spam to non-spam in the test set.
The Problem of Small Probabilities
Care must be taken in handling small and zero probabilities in a Bayesian
classifier for two reasons:
- Zeros in the conditional
probability tables can make the classifier fragile. Suppose that a
particular word is never seen in any non-spam in the training data (but it
has been seen in some spam, and therefore is in the dictionary).
Then, if the probability of that word conditioned on non-spam is set to
zero, then any message containing that word will be classified as spam
with probability 100%. While this may be desirable if the word in
question is "Viagra", it is probably not a good idea for the
word "money" (especially if you get a message with a job offer
from Bill Gates that says "money is no object"). The
solution to this problem is to set the conditional probability for an
unseen word to a very small, non-zero probability. The value should
be smaller than the smallest value assigned to any word - that is, smaller
than the value assigned to a word that is seen only once in the training
data. A simple rule of thumb is to set the probability to 1/4 of the
smallest probability assigned to any seen word. (For more
mathematically rigorous ways of solving this problem look for solutions to
the "sparse data problem" in a good statistics textbook or on
the web.)
- Multiplying together small
probabilities can lead to numeric underflow. For example, suppose
you are calculating
P(w1,...,wk|Spam) =
P(w1|Spam)...P(wk|Spam)
where each P(wi|Spam)=0.01 and k=100. This gives a value of
10**(-200). This problem can be solved by representing probabilities
by their logarithms. Using logs has the further advantage that
multiplication is replaced by addition. Note that using logs
requires you to adopt one of the solutions to problem 1, since you cannot
take the log of zero.
Other Sources of Data
Several sets of spam and non-spam email have been collected in public
machine learning databases. Here is a copy of one of them, lingspam_public.tar.gz. The email
messages here have been "semi-digested", by having words converted to
lowercase, multiple spaces and newlines deleted, and so on. You might
find it useful to use this data while building and testing your system, before
you have your own data set ready to go.
Deadlines
By Monday Nov 24 you must have formed a team. One person
from the team should send email to Jeffrey <jbigham@cs.washington.edu>
letting him know who is on your team. Jeffrey will assign a shared
workspace on the instructional servers to your team. If you have been
unable to join a team by that day send email to Jeffrey and he will assign
any
remaining people to new or existing teams.
By 10am Friday Dec 5 you should turn in your results using the
turn-in procedure described below. You should turn in both your program
code and a write-up of about 4 pages that clearly describes:
- What you implemented,
including any extensions
- A high-level description of
the data structures and algorithms you used, and significant design
decisions
- How you evaluated your
system: How many messages were used for training? For
testing? Were they gathered by one person or several?
- Measurements of the
performance of your system. At a minimum this should include:
- Type I error:
What percentage of test spam messages were incorrectly classified as
non-spam? For naive Bayes, classify a message as spam if its
probability of being spam is > 50%.
- Type II error: What
percentage of test non-spam messages were incorrectly classified as spam?
- A brief discussion of
insights you gained.
Turn-in instructions
Login to one of the undergradaute instructional unix machine such as attu,
and then submit your work with a command such as:
turnin -c cse473 yourfile1.c yourfile2.c
writeup.txt
You can submit as many files as you want on the command line, but each
submission overwrites any previous ones - so you need to make sure to submit
everything again if you decide to submit multiple times. For more information
check the man page for turnin.