Problem Set 1 - Naive-Bayes Document Classification

Due 10/12/10 before class starts.

To be completed by pairs of students working together

Your goal for this homework is to implement a naive-Bayes classifier for textual documents, train it on a labeled training set of Wikipedia documents selected from six classes, and test the performance of your learned system.

You will be using naive Bayes, following the pseudocode in Manning, Raghavan, and Schutze (see also section 12.1.3 for more on multinomial models). To avoid anomalies when encountering new words (which have zero counts), you should use Laplace ("add one") smoothing as shown in equation 13.7 (pdf version) or 119 (html version).

Your classifier will use words as features, add the logprob scores (vs multiply probabilities directly!) for each token, and decide which class has the highest a posterori score. Please don't use the document (file) names as features.

You may use any programming language you like, but we suggest sticking to wide-spread and portable languages like Java, Python, or Perl.

Datasets

You will find all the training data in the following zip file. WinRAR can decompress .tar.gz files and you can get it free here

There is a separate directory for each of the six classes: actor, book, company, stadium, university, and writer. Some of these classes are going to be harder to distinguish than others (e.g. actor, writer and book may be harder to distinguish than book and company), and hopefully that will make it interesting.

Each class directory has a subdirecrtory (called 'train') containing 200 training examples, each a file corresponding to a Wikipedia page in that class. For example, one file in the actor directory might have the name "Brad_Pitt". In fact, you'll notice that we took the first 200 pages in each class, so most will start with the letter A. Please don't use the filename as a feature. Furthermore, since some of the files were really long, we truncated all files after the first 600 words. Note that some of the filenames (e.g 2 of the companies and 3 of the books) start with dots, making them hidden if you are using a form of windows. You can either deal with these using dir/dir.* or run your code in Windows or ignore those files, since training on 197 documents will give you essentially the same results.

All of the files have been cleaned and tokenized for your viewing and classifying pleasure. The files have been cleaned of html markup and all punctuation symbols should be represented by their own tokens. For example, the sentence "Albert Einstein hates peas." will be represented in the file as "Albert Einstein hates peas .". Since the identification of the sentence boundaries was done automatically by a sentence classifier, the files may not be perfect. However, they should be good enough for your purposes.

Evaluation

To determine how well your classifier works, you should first use 10-fold cross-validation and report your own accuracy instead of us doing it for you. Recall that in cross-validation, you divide up the data into 10 sections, and then you train and test your classifier 10 times, each time choosing a different section as the test set and the other 9 as the training set. Since we are giving you data for six different classes, you'd split the data from each class into 10 subsets. Since you are given 200 examples of each class, successive blocks of 20 files (in alphabetic order) are the sections you should use for cross-validation.

Thus, when learning a 6-way classifier, the training set at each fold would contain six times 180 training examples and 6 times 20 test examples.

Your final accuracy is the average of the 10 runs. For problem set 1, all you need to do is report the average precision and recall for each class, and not for the entire dataset overall. (See note 1.)

On Thurs 10/7/10 we will release a separate test set consisting of a single directory, which contains a mixture of unlabeled files from the same six classes. Please don't access this test file until your algorithm is debugged and working the way you like. Then as your final test, please run on this set and output a file test.output containing one line for each file in the test set which lists the filename, a tag. This tag should be the predicted classname for that file: ACTOR, BOOK, COMPANY, STADIUM, UNIVERSITY,or WRITER. For example:

100_Books_by_August_Derleth|BOOK
Aage_Stentoft|ACTOR
Aamir_Khan|ACTOR

The test set for the document classifier is now available here It is one folder with a mixture of documents from all the classes (I believe there are 66 from each) where the class assignments are not given. The evaluation section of the homework description explains the format we want your results in.

Extra Credit

For extra credit, you can try various augmentations to this basic algorithm and see if you can improve your accuracy. For example, you could throw away stop words (very frequent words like "the", "of", "it", etc) - there are lists of these on the Web, for example here. Or discard very rare words (e.g. numbers, dates or proper names which are unlikely to generalize well).

Or try training on a much larger set of documents, perhaps using Hadoop to make it run tractably. (Such a file is available - just email the TA if you are interested, but be careful it's really big! If you do this, you might wish to vary the size of the training set and record accuracy for differing sizes, e.g. 180 training examples, 500, 1000, or larger sizes.

Or try using a different type of classifier and seeing if it works better than Naive Bayes. An easy way to do this is to use the Weka machine learning package, which contains many learning algorithms in a simple-to-use framework.

If you want to do the extra credit, you have to first tell us your precision and Recall with the basic algorithm, and then your (improved) accuracy using whatever enhancement you tried.

Turn-in Details

This project component is due on the date listed above, before class.

Please turn in the following: names of the two classmembers, your code, your build, config, and run files. We also need a README, which includes a table of precision and recall for each fold and the final average values.

Your readme file should also describe any extra credit options you tried, if any.

Please write the naive-Bayes algorithm yourselves, but if you use any third-party code (perhaps in extra credit) your readme file should list what you used, from where, and what you used it for. Similarly, if you receive significant advice from someone outside your group (besides the TA and Dan), please describe this interaction briefly.

The readme should also include a sentence or two describing the breakdown of work performed between the two students in the team.

Finally, a separate file "test.csv" containing the test filenames and tags as specified above.

Turn your files in via Catalyst at the assignment dropbox.

Please submit the following files:

  1. plain-text README.txt, containing:
    1. Team member names
    2. 1-2 sentences explaining breakdown of work in your team
    3. Description, as specified, of any outside help received or third-party code used.
    4. Any instructions about your code files or running your code
    5. Table of precision and recall for each fold and the final average values
    6. Extra credit: describe what you did, and the (improved) accuracy results table
  2. Your actual code, build, config and run files.
  3. test.output,  output file generated from the final test set we give you, as specified in "Evaluation" section of assignment.

Note 1: Question from a student, & Dan's response.
I'm a little confused about how to compute the precision and recall. If you want the precision and recall for each class I think I understand that.
Precisionclass = (num of documents in class correctly identified) / (number of files that were guessed to be of the class)
Recallclass = (num of documents in class correctly identified) / (number of files in that class)
but then, what would an overall precision and recall, my guess is
Another great question; my apologies for not explaining this in class or the assignment.
Precision = (number of correctly identified documents) / (total number of files)
This is correct for precision, but it doesn’t really make sense to compute recall globally if there are more than 2 classes.
or is the general precision and recall the average of the class precisions and recalls?
I believe that if you average the class precisions (weighting by class size) you will end up with the same value for overall precision (accuracy) as in the formula above.
Averaging the recall values would also yield an interesting value; again you’d need to weight by class size.
For the problem set, however, all I want is precision and recall for each class.