CSEP 573: Applications of Artificial Intelligence

 

Homework Assignment #4

 

Due: Before midnight Tuesday, March 16, 2010

 

Turn-in procedure:  Use the dropbox: https://catalysttools.washington.edu/collectit/dropbox/afriesen/8677

to submit a folder containing a document with all your answers and any supporting files.  Name the folder HW4_<LastnameFirstname>.

 

 

Reading: Chapters 18 and 20 in AIMA (2nd/3rd ed.). Chapters on Weka in Witten & Frank (optional).

 

 

Machine Learning for Spam Detection

 

“Viagra for free, pay shipping only” “RE: SALE 70% OFF on Pfizer” “PLEASE VERY CONFIDENTIALl AND CALL ME:+226 78 50 64 94” “URGENT BUSINESS TRANSACTION” “WINNING NOTIFICATION !!!!!!!!” “HOT 79% OFF” “Your erotic nights are here to stay” “Get maximum form love” “Tax Refund Notification !” “Claim your degree.” “Michael Jackson nicht Tod”

 

These are examples of words (PG-13 only) found in spam email received at a UW CSE account over the past month. Spammers are constantly finding new ways to get around current spam filters.  Many early filters based on straightforward principles such as a fixed set of keywords are no longer effective because spam is constantly changing.  What we need are spam filters that can also change and adapt to new spamming strategies.  The job seems cut out for machine learning.

 

In this exercise, you will explore the application of several machine learning algorithms we discussed in class on a dataset containing examples of spam and legitimate email. The exercise is based on the work of Andy Menz at Iowa State.

 

To begin, read this report by Menz and browse this webpage which contains further information and links for downloading various files.

 

Next, download the Weka machine learning Java toolkit using this link: Weka version 3-2-3 (note: download the version in this link and not the latest version from the Weka home page; the FeatureFinder.java program we will use only works with the older version). Extract all files to a directory. Double click the executable jar file called “weka” and you should get a window for the Weka GUI with a picture of the Weka bird on it. Click on the “Explorer” button to start exploring the world of machine learning weka-style. The interface is very intuitive but if you need more information, you can read the files README and Tutorial.pdf in your Weka directory or consult the Witten & Frank Data Mining textbook.

 

Download the feature extractor program FeatureFinder.java that extracts features from email text files. Read the documentation for FeatureFinder. Create a directory called Ling-Spam in the same directory as FeatureFinder. Download the Ling-Spam dataset and extract the dataset into your Ling-Spam directory (you should end up with files in various subdirectories of the form Ling-Spam/lingspam_public/bare/part# where # is 1-10). FeatureFinder requires that you ungzip all of the files in the folders in lingspam_public/bare/part# before running it.

 

Compile FeatureFinder with weka.jar (in the weka-3-2-3 folder). Execute FeatureFinder using the following setting of parameters: Feature Vector = Boolean, No Stemming, No Stop Terms, Number of Features = 250.  The output will be an .arff file you can use with Weka. To get familiar with the .arff file format, you may open the file using a text editor such as Notepad. Other example data files can be found in the “data” directory in the weka-3-2-3 folder.

 

Use the Weka Explorer GUI to run all your experiments below.

Turn in a write-up answering the questions for each experiment.

(Each problem below is worth 25 points)

 

  1. In the Weka Knowledge Explorer window, use “Open file…” to open your arff file. Investigate the performance of K-means and EM-based clustering algorithms on your dataset (use weka.clusterers.SimpleKMeans and weka.clusterers.EM)
    1. You would like the algorithms to find two clusters corresponding to “Spam” and “Non-Spam” respectively. Set the number of clusters to 2 and choose the “Classes to cluster evaluation” mode. Run each algorithm for 5 different seed values and write down the resulting accuracy (percent correctly clustered). In each case, you may visualize your clustering output by right-clicking on the last item in “Result list” and selecting “Visualize cluster assignments.” Note down the best accuracy you can get for each algorithm over the 5 seed values you tried. Copy and paste the output of Weka for each algorithm for the seed value that gave the best accuracy. Between K-means and EM, which algorithm yielded higher accuracy?
    2. It is likely the data set contains more than two clusters. Try to find this “natural” number of clusters by running the EM algorithm for different numbers of clusters: use the values 1, 2, 3, 4, 5, 10, 50, 100. Train on 60% of the data and compute the log likelihood values for the remainder. Plot these log likelihood values. Based on the plot, what is a good guess for the “natural” number of clusters in the data? Now set the number of clusters to -1. This causes Weka to select the number of clusters through cross-validation. How does this number compare with your estimate?

 

  1. What is the accuracy rate (% correctly classified instances) based on 10-fold cross-validation obtained using the following classifiers (use Weka’s default values for all classifier parameters not specified below):
    1. Decision Tree (use weka.classifiers.j48.J48 – a decision tree learner). You may visualize the decision tree by right clicking on the highlighted item in “Result list”.
    2. Neural Network with no hidden layers and Backpropagation Learning (use the Weka classifier weka.classifiers.neural.NeuralNetwork). Click in the classifier box and then click on “More” to find out how to set the number of hidden layers and number of units in each layer. Note that the GUI allows you to visualize the network but also may slow down the training, so you may want to set GUI to False during training.
    3. Neural Network with one hidden layer and Backpropagation Learning. Set the number of hidden units to half the number of input units.
    4. Support Vector Machine (SVM)  (use the Weka classifier weka.classifiers.SMO)

 

  1. In the problems above, we used a specific setting of parameters for FeatureFinder. Read the Experiment and Experimental Results sections of Menz’s report. Using the same baseline and values for Features as in his Conditions 0 through 13, run the SVM learner (weka.classifiers.SMO) on the corresponding .arff files. Report the accuracy rates. Copy and paste the output of Weka for the last two conditions. Did any of the feature settings result in better performance than the best classifier in Problem 2 above? Did any of them beat the best performing classifier in Menz’s report?

 

  1. Look through your own email (and spam folder) and create a test dataset containing 10 random examples of spam email and 10 random examples of legitimate email you have received over the past month. Change FeatureFinder.java to MyFeatureFinder.java for processing the email text files in your test dataset rather than the files stored in the various Ling-Spam subdirectories. Run MyFeatureFinder (with the best performing setting of parameters from Problem 3) on your dataset and get an .arff file. Pick the trained SVM classifier that performed the best in Problem 3 and run this pre-trained classifier on the “Supplied test set” that you just created. Copy and paste the output of Weka to your write-up. What is the accuracy rate of the classifier on your dataset?
    1. Is there a significant difference between the accuracy rates for your best classifier between the Ling-Spam dataset and your test dataset? Discuss some possible reasons for this difference.
    2. What are some potential ways to improve the classifier so that it does a better job of filtering spam in your email?

 

 

Extra Credit Problem 1 (25 points): Doing it better with better data (maybe). The Ling-Spam dataset is based on email sent to a linguistics mailing list. A more general dataset is the SpamAssassin corpus.  Create a dataset of spam/ham email messages (ham = non-spam email) from the SpamAssassin corpus and run a version of MyFeatureFinder to obtain an .arff file. Repeat the experiments in Problems 3 and 4 (but using the SpamAssassin dataset instead of Ling-Spam). Is the accuracy rate of the SpamAssasin classifier on your test dataset better than what you got in Problem 4 using the Ling-Spam classifier?

 

Extra Credit Problem 2 (25 points): Tracking spam evolution. Spamming techniques typically evolve in response to progressively better spam filters. As a result, the criteria that a spam filter might rely on to catch spam can also be expected to evolve over time. The goal of this problem is to get a snapshot of spam evolution in action. Download a dataset from SpamAssassin corpus and split the data into several different time periods (e.g., Jan-March, March-May, etc.). For each time period, run the J48 decision tree learner (weka.classifiers.j48.J48) in Weka. Copy and paste to your write-up the output of Weka for the first and last time periods. Recall that the decision tree algorithm chooses attributes according to information gain. Note down the top 6 attributes used by the decision tree for each time period (you can get the top 6 attributes via a breadth-first traversal of the decision tree; the tree can be visualized by right clicking on the highlighted item in “Result list”). Construct a table where each column lists the top 6 attributes for one time period. Based on the table, do you see any trends in the evolution of spam in this dataset? Are there attributes that drop in importance, remain relatively constant, or rise in importance over the time course of the dataset?

 

Extra Credit Problem 3 (25 points): Personalizing your spam filter. Create a reasonably large personalized training dataset from your own email containing examples of spam and non-spam email. Repeat the experiments in Problems 3 and 4 (but using your personalized email training dataset instead of Ling-Spam). Is the accuracy rate of your personalized spam classifier on your test dataset better than what you got in Problem 4 using the Ling-Spam classifier?