Please submit both code and writeup online by 6:30pm PST on Thursday, May 17, 2012. Please provide all code (sufficiently commented) so that we can run it ourselves. Submit your writeup as a PDF and limit to four pages with reasonable fonts and margins.
1.0The dataset we will be using is a subset of 2005 TREC Public Spam Corpus. You can download it here. It contains a training set and a test set. Both files use the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham (non-spam), and the rest are words and their occurence numbers in this email. In preprocessing, I removed non-word characters, and conducted feature selection similar to what Mehran Sahami did in his original paper using Naive Bayes to classify spams, though with larger cut-offs since our corpus is larger.
1.1 Implement the multinomial naive Bayes algorithm classify spam. Use your algorithm to learn from the training set and report accuracy on the test set.
1.2 Try various smoothing parameters for the Naive Bayes learner. Which parameters work best?
Extra-credit Our feature selection makes learning much easier, but it also throws out useful information. For example, exclamation mark (!) often occurs in spams. Even the format of email sender matters: in the case when an email address appears in the address book, a typical email client will replace it with the contact name, which means that the email is unlikely to be a spam (unless, of course, you are a friend of the spammer!). Sahami's paper talked about a few such features he had used in his classifier. For extra credit, you can play with the original files and come up with useful features that improve your classifier. Here are the lists of the files used in train/test.
For those who are interested in adversial classification, the emerging theme for spam detection, here is a nice KDD-04 paper from UW. Also, in 2007 NIPS (a popular machine learning conference) held a workshop on adversial learning (archive.org: links don't work, but you can see titles and authors).
Problem writeup:
2.1 Mitchell 4.9
2.2 Mitchell 4.10
3.0 Implement the bagging algorithm on top of your decision tree learner from assignment one. You should make your code as modular as possible. Namely, your main module of bagging should treat the base learner as a blackbox and communicate with it via a generic interface that inputs a set of instances and outputs a classifier, which then can classify any instances. Create a wrapper for ID3 which takes a training set of examples and use sampling with replacement to generate a new training set of the same size; the wrapper then passes on this set to ID3 to obtain a classifier. Plug this wrapper and your ID3 implementation into the bagging algorithm.
Run experiments using the UCI Molecular Biology (Promoter Gene Sequences) Data Set, which we have provided in ARFF format (106 total instances, 57 attributes: train and test). Use information gain as the evaluation criterion and disable pruning for all experiments. Your goal is to predict whether a certain DNA sequence is a promotor or not (class attribute: + or -).
Background: A promoter is a region of DNA that facilitates the transcription of a particular gene. Promoter compilations and analyses have led to computer programs which predict the location of promoter sequences on the basis of homology either to the consensus sequence or to a reference list of promoters. Such programs are of practical significance in searching new sequences.
3.1 Plot test set accuracies for 1, 3, 5, 10, and 20 samplings.