Please submit both code and writeup online by 12.01am PST on Saturday, May 21, 2016. Please provide all code (sufficiently commented) so that we can run it ourselves. Submit your writeup as a PDF and limit to four pages with reasonable fonts and margins.
1.0The dataset we will be using is a subset of 2005 TREC Public Spam Corpus. You can download it here. It contains a training set and a test set. Both files use the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham (non-spam), and the rest are words and their occurence numbers in this email. In preprocessing, I removed non-word characters, and conducted feature selection similar to what Mehran Sahami did in his original paper using Naive Bayes to classify spam, though with larger cut-offs since our corpus is larger.
1.1 Implement the multinomial naive Bayes algorithm classify spam. Use your algorithm to learn from the training set and report accuracy on the test set.
1.2 Try various smoothing parameters for the Naive Bayes learner. Which parameters work best?
Extra-credit Our feature selection makes learning much easier, but it also throws out useful information. For example, the exclamation mark (!) often occurs in spam. Even the format of email sender matters: in the case when an email address appears in the address book, a typical email client will replace it with the contact name, which means that the email is unlikely to be a spam (unless, of course, you are a friend of the spammer!). Sahami's paper talked about a few such features he had used in his classifier. For extra credit, you can play with the original files and come up with useful features that improve your classifier. Here are the lists of the files used in train/test.
For those who are interested in adversarial classification, the emerging theme for spam detection, here is a nice KDD-04 paper from UW. Also, in 2007 NIPS (a popular machine learning conference) held a workshop on adversarial learning (archive.org: links don't work, but you can see titles and authors).
Problem writeup:
2.1 Mitchell 4.5
2.2 Mitchell 4.8
3.0 Implement the bagging algorithm on top of your decision tree learner from assignment one. You should make your code as modular as possible. Namely, your main module of bagging should treat the base learner as a black box and communicate with it via a generic interface that inputs a set of instances and outputs a classifier, which then can classify any instances. Create a wrapper for ID3 which takes a training set of examples and use sampling with replacement to generate a new training set of the same size; the wrapper then passes on this set to ID3 to obtain a classifier. Plug this wrapper and your ID3 implementation into the bagging algorithm.
Run experiments using the UCI Diabetes Dataset (train, test). Use information gain as the evaluation criterion and disable pruning for all experiments. Your goal is to predict whether the test is positive or negative for diabetes.
Background: A description of the Diabetes Dataset can be accessed here. This dataset was gathered among a population that is at high risk of diabetes by the National Institute of Diabetes and Digestive and Kidney Diseases. The dataset consists of 768 cases, 8 variables and 2 classes. The variables are medical measurements on the patient plus age and pregnancy information. The classes are: tested positive for disbetes or negative.
3.1 Plot test set accuracies for 1, 3, 5, 10, and 20 samplings.