CSE 473: Artificial Intelligence

Assignment #5

Bayesian Networks and Machine Learning

Due: Wednesday, December 6, 2006 at the beginning of class

Reading: Chapters 14, 18, and 20.

Probabilistic Reasoning and Bayesian Networks (20 points: 10 for (a), 10 for (b)). This problem is based on the Burglar Alarm Bayesian network shown in Figure 14.2, with its associated CPTs.

Use inference by enumeration to compute the following (see Lecture Slides and pages 504-505 in the text for an example):

i. P(Burglary | JohnCalls = true, MaryCalls = false)

ii. P(Alarm)

Now, compute P(Burglary | JohnCalls = true, MaryCalls = false) using the variable elimination algorithm. Show all the steps and factors involved (see pages 507-508 in the text for an example).

Learning Decision Trees (20 points: 10 for (a), 10 for (b)). An important component of the decision tree learning algorithm is the CHOOSE-ATTRIBUTE function. We discussed in class how an attribute can be chosen at a particular level in the tree based on information gain (or equivalently, reduction in entropy) – see Lecture Slides and pages 653-660 in the text. The following questions relate to the Restaurant domain in the text (Figure 18.3).

For the root node, we computed information gain for Patrons and Type in class (also in the text). Compute the information gain at the root node for the attributes Price and Hungry (Hun), and verify that the information gain for these two attributes is less than that for Patrons.
Suppose the algorithm has chosen Patrons to be the root node. Compute the information gain for Price and Hungry at the next level in the tree along the branch labeled “Full” (it might be helpful to look at Figure 18.4 (b) which depicts the two level tree with Hungry at the second level).

Machine Learning for Spam Detection (60 points).

“TotallyOutOfDebt Overnight” “creditcarddebt” “Viagra” “Greaat product”

“Hottest Pick” “Stock Symbol” “Projjected Priice” “Home..Owner”

“V t a g r a” “HOTG*WIRLS” “L*cingerie”

These are examples of words (PG-13 only) found in spam email received at a UW CSE account

over the past few weeks. Gone are the days when you could hit the delete button just looking at

the Subject for words such as “XXX” or “Earn $$$ at home”. Spammers are constantly finding

new ways to get around current state-of-the-art spam filters (e.g., change “Viagra” to “V t a g r a”).

Many early filters based on straightforward principles such as a fixed set of keywords are no longer

effective because spam is constantly changing.  What we need are spam filters that also change and

adapt to new spamming strategies.  Enter…machine learning.

In this exercise, you will test the performance of several of the machine learning algorithms we

discussed in class on a dataset containing examples of spam and legitimate email. The exercise is

inspired by the work of Andy Menz at Iowa State. First, read the following report by Menz:

http://www.cs.washington.edu/education/courses/cse473/CurrentQtr/homework/Menz_Report.doc

Then, browse the following website which contains further information and links for downloading

various files.

http://www.cs.washington.edu/education/courses/cse473/CurrentQtr/homework/hw5_prob3.html

Now you are ready to tackle this assignment.

Download the Weka machine learning Java toolkit using this link: Weka version 3-2-3 (note: download

the version in this link and not the latest version from the Weka home page; the FeatureFinder.java

program we will use only works with the older version). Extract all files to a directory. Double click the

executable jar file called “weka” and you should get a window for the Weka GUI Chooser with a

picture of the Weka bird on it. Click on the “Explorer” button and you are ready to start exploring the

world of classifiers. The interface is very intuitive but if you need more information, you can read the

files README and Tutorial.pdf in your Weka directory.

Next, download the feature extractor program FeatureFinder.java that extracts features from email text

files. Read the documentation for FeatureFinder. Create a directory called Ling-Spam in the same

directory as FeatureFinder. Download the Ling-Spam dataset and extract the dataset into your Ling-

Spam directory (you should end up with files in various subdirectories of the form Ling-

Spam/lingspam_public/bare/part# where # is 1-10). Compile and execute FeatureFinder using

the following setting of parameters: Feature Vector = Boolean, No Stemming, No Stop Terms, Number

of Features = 250.  The output will be an .arff file on which you can run the classifiers in Weka.

Use the Weka Explorer GUI to run all your experiments below.

Turn in a write-up answering the following questions:

a.       In his report, Menz evaluates the performance of two classifiers: Naïve Bayes and Boosting

using Naïve Bayes classifiers. What is the accuracy rate (% correctly classified instances)

based on 10-fold cross-validation obtained using the following classifiers

(use Weka’s default values for all classifier parameters not specified below):

                                                              i.      Decision Trees (use weka.classifiers.j48.J48 – a C4.5 decision tree learner)

                                                            ii.      Neural Networks with Backpropagation Learning (use hiddenLayers = 1 in the Weka

   classifier weka.classifiers.neural.NeuralNetwork)

                                                          iii.      Bagging using Decision Trees of the type in (i) above (use weka.classifiers.Bagging

   with classifier set to weka.classifiers.j48.J48)

                                                          iv.      Boosting using Decision Trees of the type in (i) above (use weka.classifiers.AdaBoostM1

with classifier set to weka.classifiers.j48.J48)

b.      Look through your own email and create a test dataset containing 10 random examples of spam

email and 10 random examples of legitimate email you have received over the past

month. Change FeatureFinder.java to MyFeatureFinder.java for processing the email text files in

your test dataset rather than the files stored in the various Ling-Spam subdirectories. Run

MyFeatureFinder (with the same parameter settings as above) on your dataset and get an .arff

file. Pick the classifier that performed the best on Ling-Spam (out of the four in (a) above) and

run this pre-trained classifier on this “Supplied test set” that you created. Copy and paste the

output of Weka to your write-up. What is the accuracy rate of the classifier on your dataset?

c.       Is there a significant difference between the accuracy rates for your best classifier in (a) and (b),

i.e., between the Ling-Spam dataset and your test dataset? Discuss some possible reasons for this

difference.

d.      What are some potential ways to improve the classifier so that it does a better job of filtering

spam in your email?

Extra Credit Problems

(Due Saturday, December 9, 2006 by Midnight via Email to Raj; Include “473” in the Subject line)

Note: The following extra credit problems are optional – you will not be penalized in your overall class

grade if you decide not to do them. Your score on these problems, if attempted, will be factored into your

grade after the final course grades have been determined based on all other homeworks and exams.

Each problem will add a maximum of 5% to your overall course grade (Homeworks 1-5 each contribute 10%).

Extra Credit Problem 1: Going all out on Feature Selection. In Problem 3 above, we used a specific

setting of parameters for FeatureFinder. Read the Experiment and Experimental Results sections of

Menz’s report. Using the same baseline and values for Features as in his Conditions 0 through 13, run

Boosting with Decision Trees (as in Problem 3a (iv) above) on the corresponding .arff files. Report the

accuracy rates. Copy and paste the output of Weka for some of the conditions. Did any of these feature

settings result in better performance than the best classifier you obtained in Problem 3a? Did any of

them beat the best performing classifier in Menz’s report?

Extra Credit Problem 2: Doing it better with better data (maybe). The Ling-Spam dataset is based on

email sent to a linguistics mailing list. A more general dataset is the SpamAssassin corpus (note: ham = non-

spam email). Create a dataset of spam/ham email messages from the SpamAssassin corpus and run a version of

MyFeatureFinder to obtain an .arff file. Run the experiments in Problem 3 (a). Copy and paste to your write-up

the output of Weka for each the classifiers. Compare these results with those you obtained in Problem 3 (a).
What classifier performed best?

Extra Credit Problem 3: Evolving spam (based on a idea by Ravi Kiran and Indriyati Atmosukarto).

Spamming techniques typically evolve in response to progressively better spam filters. As a result, the

criteria that a spam filter might rely on to catch spam can also be expected to evolve over time. The goal of

this problem is to get a snapshot of spam evolution in action. Download a dataset from SpamAssassin corpus

and split the data into several different time periods (e.g., Jan-March, March-May, etc.). For each

time period, run the J48 decision tree learning algorithm in Weka. Copy and paste to your write-up the

output of Weka for the first and last time periods. Recall that the decision tree algorithm chooses attributes

according to information gain. Note down the top 6 attributes used by the decision tree for each time period

(get the top 6 by doing a breadth-first traversal of the tree). Construct a table where each column lists the

top 6 attributes for one time period. Based on the table, do you see any trends in the evolution of spam in

this dataset? Are there attributes that drop in importance, remain relatively constant, or rise in importance

over the time course of the dataset?