Due Date: Wed, May 28, 2003 at 6:30 PM
Form a group of two, and email Parag (parag@cs.washington.edu) by
Monday the 19th, saying who your group is. Please use "Project2: Group" as the subject of your email.
The project is to implement text classification algorithms and
apply them to a standard data set, the Reuters collection of news
stories. The goal is to predict the topic of the each news story in the
test set, given the topics of the stories in the training set. You will
implement the following algorithms.
- Naive Bayes
- K-nearest neighbor
Naive Bayes should be implemented as described in class and in Section
6.10 of Mitchell. Be sure to implement Naive Bayes as a sum of logarithms
of probabilities, rather than a product of probabilities, to avoid numeric
underflow problems.
For k-nearest neighbor, you should try the following
representations/distance measures.
- Hamming distance: between binary vectors, where each bit of a
document's vector represents whether the corresponding word/token
appears in the document.
- Euclidean distance: between numeric vectors, where each coordinate
is the number of times the word/token occurs in the document.
- Cosine distance (often used in information retrieval):
dot product of the vectors in (b) above, divided by the
product of their norms.
You should try k = 1, k = 3 and k = 5 with
each of the distance measures above.
Feel free to experiment with additional
variations of the algorithms, and to investigate any other relevant
design choices. Please describe these (along with their justification
and evaluation) in your report.
Turn in the following:
- Your code, and reasonable documentation for it (i.e., enough for us
to understand how it works and how to use it). Please place this documentation
in a file named
README .
-
A report of at most 4 pages. Include a description
of your architecture, approach, extra features, bugs and any problems
you had. Report the test-set accuracy of each algorithm that you
tried, and its standard deviation. Comment on the results. Which
algorithm(s) did best? Why? When does each algorithm tend to do well
versus poorly? How would you further improve your best text classifier?
- A breakdown of what each group member did in the project.
Turn-in procedure:
Follow the
link for turning in your project. The deadline for submission
is before class on Wed, May 7. Again, Word/PS/PDF/HTML/plain
text files are ok for the report.
Files:
Format:
The format of the test and training files is the same. Each file contains
a set of examples delimited by two blank lines. Each example is composed
of the following parts:
- topic classification.
- blank line
- story title
- blank line
- story location, date
- blank line
- story text
Updated: 5/22, Thursday Training/Test data have been cleaned of some
of the format problems that they had. In particular, the articles with no title and text have
been removed from training/test data. Note that if you have already read in the old version of the
data and want to work with that, you are free to do that. The links to both the versions are available below.
URLs:
- Training Data
Updated Version
Old Version
- Test Data
Updated Version
Old Version
-
These are the topics used for classifying the texts:
- The original source for this data is the
UCI Knowledge Discovery in Databases
Archive
- The Reuters-21578 Text Categorization Collection was the collection used
to generate the test and training data above. The SGML was converted to
XML, cleaned up, and queried using XPath to generate a simple text file
format. To access the collection,
Click here.
We may ask you to do a demo of your system and/or an oral discussion.
Good luck, and have fun!
|