CSEP 546 - Data Mining - Spring 2003 - Project 2:

Text Classification

Due Date: Wed, May 28, 2003 at 6:30 PM

Form a group of two, and email Parag (parag@cs.washington.edu) by Monday the 19th, saying who your group is. Please use "Project2: Group" as the subject of your email.

The project is to implement text classification algorithms and apply them to a standard data set, the Reuters collection of news stories. The goal is to predict the topic of the each news story in the test set, given the topics of the stories in the training set. You will implement the following algorithms.

Naive Bayes
K-nearest neighbor

Naive Bayes should be implemented as described in class and in Section 6.10 of Mitchell. Be sure to implement Naive Bayes as a sum of logarithms of probabilities, rather than a product of probabilities, to avoid numeric underflow problems.

For k-nearest neighbor, you should try the following representations/distance measures.

Hamming distance: between binary vectors, where each bit of a document's vector represents whether the corresponding word/token appears in the document.
Euclidean distance: between numeric vectors, where each coordinate is the number of times the word/token occurs in the document.
Cosine distance (often used in information retrieval): dot product of the vectors in (b) above, divided by the product of their norms.

You should try k = 1, k = 3 and k = 5 with each of the distance measures above.

Feel free to experiment with additional variations of the algorithms, and to investigate any other relevant design choices. Please describe these (along with their justification and evaluation) in your report.

Turn in the following:

Your code, and reasonable documentation for it (i.e., enough for us to understand how it works and how to use it). Please place this documentation in a file named README.
A report of at most 4 pages. Include a description of your architecture, approach, extra features, bugs and any problems you had. Report the test-set accuracy of each algorithm that you tried, and its standard deviation. Comment on the results. Which algorithm(s) did best? Why? When does each algorithm tend to do well versus poorly? How would you further improve your best text classifier?
A breakdown of what each group member did in the project.

Turn-in procedure: Follow the link for turning in your project. The deadline for submission is before class on Wed, May 7. Again, Word/PS/PDF/HTML/plain text files are ok for the report.

Files:

Format:

The format of the test and training files is the same. Each file contains a set of examples delimited by two blank lines. Each example is composed of the following parts:

topic classification.
blank line
story title
blank line
story location, date
blank line
story text

Updated: 5/22, Thursday

Training/Test data have been cleaned of some of the format problems that they had. In particular, the articles with no title and text have been removed from training/test data. Note that if you have already read in the old version of the data and want to work with that, you are free to do that. The links to both the versions are available below.

URLs:

Training Data

Updated Version

Old Version
Test Data

Updated Version

Old Version
These are the topics used for classifying the texts:
The original source for this data is the UCI Knowledge Discovery in Databases Archive
The Reuters-21578 Text Categorization Collection was the collection used to generate the test and training data above. The SGML was converted to XML, cleaned up, and queried using XPath to generate a simple text file format. To access the collection, Click here.

We may ask you to do a demo of your system and/or an oral discussion.

Good luck, and have fun!