Classifying Spam using Machine Learning
(Based on a web page by Andy Menz)
If you've ever used electronic mail, chances are good that you've encountered the malevolent creature known only as spam. Whether titled "Free Timeshares!!!" or "Hot XXX Action", you can usually tell without even looking at the message that it's spam. The problem is, how do you train a computer to know whether a message is spam or not? At first glance, it may seem simple - just search the message for certain words - like "XXX" or "FREE MONEY" - and delete those. Unfortunately, spammers are much more intelligent than that and once they understand your filter, they will find ways around it. For instance, what was once "XXX" and caught by a simple filter may metamorph into "--XxXX--". Many early filters are no longer effective because spam is constantly changing. So, to counter it, we need a filter that is constantly changing. Here we enter the fields of text recognition and machine learning.
Text classification is a field that focuses on teaching machines how to classify documents into classes. Your favorite search engine can do this fairly well. Type in "eggplant" and a powerful machine learning algorithm scans millions of documents and returns only those pertaining to eggplants. Now, what if we apply this technology to classifying spam emails? It turns out that many researchers have had a great deal of luck using machine learning (ML) algorithms to detect spam. ML algorithms are interesting because they can change the way they classify based on their input. So, if a classifier stops working well after a period of time (because the form of spam has changed), one merely needs to rebuild the classifier using more recent emails and the ML will output a new classifier that's much more effective. In this way, the filter can never be outdated, and no matter how hard they try, spammers won't be able to get their wares past our dutiful filter.
Before the classifier is trained, we first need data to train it with. It turns out that a classifier will work much better if we take the time to analyze a bunch of average emails and determine which features (words) will help the most in classification. These features are combined into a feature vector for each message, which can then be used to train a classifier. The program FeatureFinder (see below for details) uses a corpus of over 3000 emails (about 16% spam) and uses a variety of feature selection methods to find the best features to build a classifier. Some of the parameters that can be experimented with:
Feature Vector Size - the number of features to use when training the
Feature Vector Type - Boolean, TF, or TF-IDF
Stop Terms - words that can be ignored like "a", "as", "the", etc.
Word Stemming - removing suffixes - e.g. "building" and "builder" become "build"
For more information on how all these work, see Menz's report below.
Once the features have been selected, it's time to create a classifier. We will use classifiers available in the Weka toolkit (see link below).
FeatureFinder.java - Used to create .arff input files for Weka classifiers. You need to compile it first though.
SpamDiagnostic.exe - Used to analyze diagnostic files from FeatureFinder.java. Let's you see which features were used in the feature vector.
Weka version 3.2.3 - A powerful toolkit of machine learning. Download and extract to a directory.
The Ling-Spam Dataset - Link to the input dataset.
Andy Menz's Report - Report on experiments using Weka for spam filtering. (MS Word .doc format)
FeatureFinder Documentation - Explains how to run FeatureFinder. (MS Word .doc format)
SpamDiagnostic Documentation - Explains how to run SpamDiagnostic. (MS Word .doc format)