Problem Set 2 - Naive-Bayes Document Classification
Due Feb.8 11:59pm
Your goal for this homework is to implement a Naive Bayes
classifier for textual documents, train it on a labeled training
set of Wikipedia documents selected from six classes, and test the
performance of your learned system.
You will find the training data.
There is a separate directory for each of the six classes: actor,
book, company, stadium, university, and writer. Some of these
classes are going to be harder to distinguish than others (e.g.
actor, writer and book may be harder to distinguish than book and
company), and hopefully that will make it interesting.
Each class directory has a subdirecrtory (called 'train') containing
200 training examples, each a file corresponding to a Wikipedia page
in that class. For example, one file in the actor directory might
have the name "Brad_Pitt". In fact, you'll notice that we took the
first 200 pages in each class, so most will start with the letter A.
Please don't use the filename as a feature. Furthermore, since some
of the files were really long, we truncated all files after the
first 600 words.
All of the files have been cleaned and tokenized for your viewing
and classifying pleasure. The files have been cleaned of html markup
and all punctuation symbols should be represented by their own
tokens. For example, the sentence "Albert Einstein hates peas."
will be represented in the file as "Albert Einstein hates peas .".
Since the identification of the sentence boundaries was done
automatically by a sentence classifier, the files may not be
perfect. However, they should be good enough for your purposes.
The testing data
will be used to evaluating your implementation. It is one folder
with a mixture of documents from all the classes where the class
assignments are not given. (We recommend that you don't access this
test file until your algorithm is debugged and working the way you
like.)
Your works
Implement the Naive Bayes classifier: You will follow the
pseudocode
in Manning, Raghavan, and Schutze (see also section 12.1.3 for more
on multinomial models). To avoid anomalies when encountering new
words (which have zero counts), you should use Laplace
(add one) smoothing, as shown in line 10 of
TrainMultiNomialNB. Your classifier will use words as features, add
the log-prob scores for each token, and decide which class has the
highest a posteriori score. (don't multiply probabilities directly!
See line 5 of ApplyMultinomialNB) Please don't use the
document (file) names as features.
Implement 10-fold cross-validation on training data: in cross-validation, you divide up the data into 10 sections, and then you train and test your classifier 10 times, each time choosing a different section as the test set and the other 9 as the training set.
Since we are giving you data for six different classes, you'd split the data from each class into 10 subsets. Since you are given 200 examples of each class, successive blocks of 20 files (in alphabetic order) are the sections you should use for cross-validation. Thus, when learning a 6-way classifier, the training set at each fold would contain 6*180 training examples and 6* 20 test examples. Your final precision and recall is the average of the 10 runs. You need to report the average precision and recall for each class. You can follow the "Definition (classification context)" section wikipedia page to compute the precision and recall score.
Run your classifier on test set: you should output a file test.output containing one line for each file in the test set which lists the filename and a tag. This tag should be the predicted classname for that file: ACTOR, BOOK, COMPANY, STADIUM, UNIVERSITY,or WRITER. For example:
100_Books_by_August_Derleth|BOOK
Aage_Stentoft|ACTOR
Aamir_Khan|ACTOR
Improve your Naive Bayes classifier(Extra credit):
- stop words: you could try to throw away stop words (very frequent words like "the", "of","it", etc). You can use the list.
- feature selection: eliminating the most infrequent words by leaving only K words as features. You could try a few values of K using cross validation and then when you found what seems best, you could apply it on the test set.
- smoothing: instead of Laplace (add one) smoothing, you could try a few other prior values, use cross validation to find what seems best, and apply it on the test set.
Turn in the followings:
- Your code. Your code should contain appropriate comments
to facilitate understanding. Include a script (named as "run.sh"
for Linux/Mac users and "run.bat" for Windows users.) to run your program.
The script takes three arguments that are (in this order): the name
of directory containing the training data; the name of directory
containing the test data; the name of the output file. We should be able to
execute
./run.sh classifier_data_small_train classifier_data_test test.output
Note: please keep the directory structure in your working
directory the same structure as the directory we give you. For example, when
you
execute cat classifier_data_small_train/actor/train/Aaron_Yoo
in your working directory, you should print the contents of the file.
- test.outputoutput file predicting the classes of test
data we give you. If you decide to do extra credits, don't put your
optimizations on this file. Instead, you should turn in another
test.extra.output showing your improvement. And you should explain in your run.sh how you can turn on/off these extra credits.
- A report of at most 4 pages (letter size, 1 inch margins,
12pt font) that describes:
-
A high-level description on how your code works.
- Table of precision and recall for 10-fold cross validation: each fold and the final average values
- Extra credit: describe what you did, your best K, your best smoothing prior,
and the new precision / recall table for 10-fold cross validation;
If your precision and recall are low, tell us what you have
tried to improve the performance and what you suspect is failing.