Due Date: Tue, Nov 30, 2004 at 6:30 PM
The project is to be done individually.
The project is to implement text classification algorithms and
apply them to a standard data set, the Reuters collection of news
stories. The goal is to predict the topic of the each news story in the
test set, given the topics of the stories in the training set. You will
implement the following algorithms.
- Naive Bayes
- K-nearest neighbor
Naive Bayes should be implemented as described in class and in Section
6.10 of Mitchell. Be sure to implement Naive Bayes as a sum of logarithms
of probabilities, rather than a product of probabilities, to avoid numeric
underflow problems.
For k-nearest neighbor, you should try the following
representations/distance measures.
- Hamming distance: between binary vectors, where each bit of a
document's vector represents whether the corresponding word/token
appears in the document.
- Euclidean distance: between numeric vectors, where each coordinate
is the number of times the word/token occurs in the document.
- Cosine distance (often used in information retrieval):
dot product of the vectors in (b) above, divided by the
product of their norms.
You should try k = 1, k = 3 and k = 5 with
each of the distance measures above.
Feel free to experiment with additional
variations of the algorithms, and to investigate any other relevant
design choices. Please describe these (along with their justification
and evaluation) in your report.
Turn in the following:
- Your code, and reasonable documentation for it (i.e., enough for us
to understand how it works and how to use it). Please place this documentation
in a file named
README .
-
A report of at most 3 pages. Include a description
of your architecture, approach, extra features, bugs and any problems
you had. Report the test-set accuracy of each algorithm that you
tried, and its standard deviation. Comment on the results. Which
algorithm(s) did best? Why? When does each algorithm tend to do well
versus poorly? How would you further improve your best text classifier?
Any of the Word/PS/PDF/HTML/plain text files are ok for the report.
Turn-in procedure:
Follow the
link for turning in your project. Make a compressed archive of all your code and the report and submit
this single archived file. The deadline for submission is before class on Tue, Nov 30.
Files:
Format:
The format of the test and training files is the same. Each file contains
a set of examples delimited by two blank lines. Each example is composed
of the following parts:
- topic classification.
- blank line
- story title
- blank line
- story location, date
- blank line
- story text
URLs:
-
Training Data
-
Test Data
-
These are the topics used for classifying the texts:
- The original source for this data is the
UCI Knowledge Discovery in Databases
Archive
- The Reuters-21578 Text Categorization Collection was the collection used
to generate the test and training data above. The SGML was converted to
XML, cleaned up, and queried using XPath to generate a simple text file
format. To access the collection,
Click here.
We may ask you to do a demo of your system and/or an oral discussion.
Good luck, and have fun!
|