Problem Set 2 - Naive-Bayes Document Classification

Due Feb.8 11:59pm

Your goal for this homework is to implement a Naive Bayes classifier for textual documents, train it on a labeled training set of Wikipedia documents selected from six classes, and test the performance of your learned system.

You will find the training data. There is a separate directory for each of the six classes: actor, book, company, stadium, university, and writer. Some of these classes are going to be harder to distinguish than others (e.g. actor, writer and book may be harder to distinguish than book and company), and hopefully that will make it interesting.

Each class directory has a subdirecrtory (called 'train') containing 200 training examples, each a file corresponding to a Wikipedia page in that class. For example, one file in the actor directory might have the name "Brad_Pitt". In fact, you'll notice that we took the first 200 pages in each class, so most will start with the letter A. Please don't use the filename as a feature. Furthermore, since some of the files were really long, we truncated all files after the first 600 words.

All of the files have been cleaned and tokenized for your viewing and classifying pleasure. The files have been cleaned of html markup and all punctuation symbols should be represented by their own tokens. For example, the sentence "Albert Einstein hates peas." will be represented in the file as "Albert Einstein hates peas .". Since the identification of the sentence boundaries was done automatically by a sentence classifier, the files may not be perfect. However, they should be good enough for your purposes.

The testing data will be used to evaluating your implementation. It is one folder with a mixture of documents from all the classes where the class assignments are not given. (We recommend that you don't access this test file until your algorithm is debugged and working the way you like.)

Your works

  • Implement the Naive Bayes classifier: You will follow the pseudocode in Manning, Raghavan, and Schutze (see also section 12.1.3 for more on multinomial models). To avoid anomalies when encountering new words (which have zero counts), you should use Laplace (add one) smoothing, as shown in line 10 of TrainMultiNomialNB. Your classifier will use words as features, add the log-prob scores for each token, and decide which class has the highest a posteriori score. (don't multiply probabilities directly! See line 5 of ApplyMultinomialNB) Please don't use the document (file) names as features.
  • Implement 10-fold cross-validation on training data: in cross-validation, you divide up the data into 10 sections, and then you train and test your classifier 10 times, each time choosing a different section as the test set and the other 9 as the training set. Since we are giving you data for six different classes, you'd split the data from each class into 10 subsets. Since you are given 200 examples of each class, successive blocks of 20 files (in alphabetic order) are the sections you should use for cross-validation. Thus, when learning a 6-way classifier, the training set at each fold would contain 6*180 training examples and 6* 20 test examples. Your final precision and recall is the average of the 10 runs. You need to report the average precision and recall for each class. You can follow the "Definition (classification context)" section wikipedia page to compute the precision and recall score.
  • Run your classifier on test set: you should output a file test.output containing one line for each file in the test set which lists the filename and a tag. This tag should be the predicted classname for that file: ACTOR, BOOK, COMPANY, STADIUM, UNIVERSITY,or WRITER. For example:

    100_Books_by_August_Derleth|BOOK
    Aage_Stentoft|ACTOR
    Aamir_Khan|ACTOR
    
  • Improve your Naive Bayes classifier(Extra credit):
    1. stop words: you could try to throw away stop words (very frequent words like "the", "of","it", etc). You can use the list.
    2. feature selection: eliminating the most infrequent words by leaving only K words as features. You could try a few values of K using cross validation and then when you found what seems best, you could apply it on the test set.
    3. smoothing: instead of Laplace (add one) smoothing, you could try a few other prior values, use cross validation to find what seems best, and apply it on the test set.
  • Turn in the followings:
    1. Your code. Your code should contain appropriate comments to facilitate understanding. Include a script (named as "run.sh" for Linux/Mac users and "run.bat" for Windows users.) to run your program. The script takes three arguments that are (in this order): the name of directory containing the training data; the name of directory containing the test data; the name of the output file. We should be able to execute
      ./run.sh classifier_data_small_train classifier_data_test test.output
      Note: please keep the directory structure in your working directory the same structure as the directory we give you. For example, when you execute
      cat classifier_data_small_train/actor/train/Aaron_Yoo
      in your working directory, you should print the contents of the file.
    2. test.outputoutput file predicting the classes of test data we give you. If you decide to do extra credits, don't put your optimizations on this file. Instead, you should turn in another test.extra.output showing your improvement. And you should explain in your run.sh how you can turn on/off these extra credits.
    3. A report of at most 4 pages (letter size, 1 inch margins, 12pt font) that describes:
      • A high-level description on how your code works.
      • Table of precision and recall for 10-fold cross validation: each fold and the final average values
      • Extra credit: describe what you did, your best K, your best smoothing prior, and the new precision / recall table for 10-fold cross validation; If your precision and recall are low, tell us what you have tried to improve the performance and what you suspect is failing.