Classifying Spam using Machine Learning
(Based on a web page by
Andy Menz)
Spam
If you've ever used electronic mail,
chances are good that you've encountered the malevolent creature known only as
spam. Whether titled "Free Timeshares!!!" or "Hot XXX
Action", you can usually tell without even looking at the message that
it's spam. The problem is, how do you train a computer to know whether a
message is spam or not? At first glance, it may seem simple - just search
the message for certain words - like "XXX" or "FREE MONEY"
- and delete those. Unfortunately, spammers are much more intelligent
than that and once they understand your filter, they will find ways around
it. For instance, what was once "XXX" and caught by a simple
filter may metamorph into "--XxXX--". Many early filters are no
longer effective because spam is constantly changing. So, to counter it,
we need a filter that is constantly changing. Here we enter the fields of
text recognition and machine learning.
Text Classification
Text classification is a field that focuses on teaching machines how to
classify documents into classes. Your favorite search engine can do this
fairly well. Type in "eggplant" and a powerful machine learning
algorithm scans millions of documents and returns only those pertaining to eggplants.
Now, what if we apply this technology to classifying spam emails? It
turns out that many researchers have had a great deal of luck using machine
learning (ML) algorithms to detect spam. ML algorithms are interesting
because they can change the way they classify based on their input. So,
if a classifier stops working well after a period of time (because the form of
spam has changed), one merely needs to rebuild the classifier using more recent
emails and the ML will output a new classifier that's much more
effective. In this way, the filter can never be outdated, and no matter
how hard they try, spammers won't be able to get their wares past our dutiful
filter.
Feature Selection
Before the classifier is trained, we first need data to train it with. It
turns out that a classifier will work much better if we take the time to
analyze a bunch of average emails and determine which features
(words) will help the most in classification. These features are combined
into a feature vector for each message, which can then be used to train a
classifier. The program FeatureFinder (see below for details) uses a
corpus of over 3000 emails (about 16% spam) and uses a variety of feature
selection methods to find the best features to build a classifier. Some of
the parameters that can be experimented with:
Feature Vector Size - the number of features to use when training the
classifier
Feature Vector Type - Boolean,
TF, or TF-IDF
Stop Terms - words that
can be ignored like "a", "as", "the", etc.
Word Stemming -
removing suffixes - e.g. "building" and "builder" become
"build"
For more information on how all these work, see Menz's report below.
Classification
Once the features have been
selected, it's time to create a classifier. We will use classifiers
available in the Weka toolkit (see link below).
Code
FeatureFinder.java
- Used to create .arff input files for Weka classifiers. You
need to compile it first though.
SpamDiagnostic.exe - Used to analyze
diagnostic files from FeatureFinder.java. Let's you see which features
were used in the feature vector.
Weka version 3.2.3 - A powerful toolkit of machine
learning. Download and extract to a directory.
Data
The
Ling-Spam Dataset - Link to the input dataset.
Documentation
Andy
Menz's Report - Report on experiments using Weka for spam filtering.
(MS Word .doc format)
FeatureFinder Documentation -
Explains how to run FeatureFinder. (MS Word .doc format)
SpamDiagnostic Documentation -
Explains how to run SpamDiagnostic. (MS Word .doc format)