Problem Set 1 - Naive-Bayes Document Classification

Due 4/14/09 before class starts.

To be completed by pairs of students working together

Your goal for this homework is to implement a naive-Bayes classifier for textual documents, train it on a labeled training set of Wikipedia documents selected from six classes, and test the performance of your learned system.

You will be using naive Bayes, following the pseudocode in Manning, Raghavan, and Schutze (see also section 12.1.3 for more on multinomial models). To avoid anomalies when encountering new words (which have zero counts), you should use Laplace ("add one") smoothing as shown in equation 13.7 (pdf version) or 119 (html version).

Your classifier will use words as features, add the logprob scores (vs multiply probabilities directly!) for each token, and decide which class has the highest a posterori score. Please don't use the document (file) names as features.

You may use any programming language you like, but we suggest sticking to wide-spread and portable languages like Java, Python, or Perl.

Datasets

You will find al the training data in the following zip file. WinRAR can decompress .tar.gz files and you can get it free at: here

There is a separate directory for each of the six classes: actor, book, company, stadium, university, and writer. Some of these classes are going to be harder to distinguish than others (e.g. actor, writer and book may be harder to distinguish than book and company), and hopefully that will make it interesting.

Each class directory has a subdirectory (called 'train') containing 200 training examples, each a file corresponding to a Wikipedia page in that class. For example, one file in the actor directory might have the name "Brad_Pitt". In fact, you'll notice that we took the first 200 pages in each class, so most will start with the letter A. Please don't use the filename as a feature. Furthermore, since some of the files were really long, we truncated all files after the first 600 words. Note that some of the filenames (e.g 2 of the companies and 3 of the books) start with dots, making them hidden if you are using a form of windows. You can either deal with these using dir/dir.* or run your code in Windows or ignore those files, since training on 197 documents will give you essentially the same results.

All of the files have been cleaned and tokenized for your viewing and classifying pleasure. The files have been cleaned of html markup and all punctuation symbols should be represented by their own tokens. For example, the sentence "Albert Einstein hates peas." will be represented in the file as "Albert Einstein hates peas .". Since the identification of the sentence boundaries was done automatically by a sentence classifier, the files may not be perfect. However, they should be good enough for your purposes.

Evaluation

To determine how well your classifier works, you should first use 10-fold cross-validation and report your own accuracy instead of us doing it for you. Recall that in cross-validation, you divide up the data into 10 sections, and then you train and test your classifier 10 times, each time choosing a different section as the test set and the other 9 as the training set. Since we are giving you data for six different classes, you'd split the data from each class into 10 subsets. Since you are given 200 examples of each class, successive blocks of 20 files (in alphabetic order) are the sections you should use for cross-validation. Thus, when learning a 6-way classifier, the training set at each fold would contain six times 180 training examples and 6 times 20 test examples. Your final accuracy is the average of the 10 runs.

Report the average precision and recall for each class.

On Thursday 4/9 we will release a separate test set consisting of a single directory, which contains a mixture of unlabeled files from the same six classes. Please don't access this test file until your algorithm is debugged and working the way you like. Then as your final test, please run on this set and output a comma-separated file containing one line for each file in the test set which lists the filename, a tag. This tag should be the predicted classname for that file: ACTOR, BOOK, COMPANY, STADIUM, UNIVERSITY,or WRITER.

Extra Credit

For extra credit, you can try various augmentations to this basic algorithm and see if you can improve your accuracy. For example, you could throw away stop words (very frequent words like "the", "of", "it", etc) - there are lists of these on the Web, for example here. Or discard very rare words (e.g. numbers of proper names which are unlikely to generalize well).

Or try training on a much larger set of documents, perhaps using Hadoop to make it run tractably. (Such a file is available - just email Chloe if you are interested, but be careful it's really big! If you do this, you might wish to vary the size of the training set and record accuracy for differing sizes, e.g. 180 training examples, 500, 1000, or larger sizes.

Or try using a different type of classifier and seeing if it works better than Naive Bayes. An easy way to do this is to use the Weka machine learning package, which contains many learning algorithms in a simple-to-use framework.

If you want to do the extra credit, you have to first tell us your precision and Recall with the basic algorithm, and then your (improved) accuracy using whatever enhancement you tried.

Turn-in Details

This project component is due Tues, 4/14/09 before class.

Please turn in the following: names of the two classmembers, your code, your build, config, and run files. We also need a README, which includes a table of precision and recall for each fold and the final average values.

Your readme file should also describe any extra credit options you tried, if any.

Please write the naive-Bayes algorithm yourselves, but if you use any third-party code (perhaps in extra credit) your readme file should list what you used, from where, and what you used it for. Similarly, if you receive significant advice from someone outside your group (besides Chloe and Dan), please describe this interaction briefly.

The readme should also include a sentence or two describing the breakdown of work performed between the two students in the team.

Finally, a separate file "test.csv" containing the test filenames and tags as specified above.

We will shortly specify the electronic turnin method. Stay tuned.