Due Date: Wed, May 7, 2003 at 6:30 PM (The sixth week of class)
- Form a group of two, and email Parag (parag@cs.washington.edu) by
Monday the 21st saying who your group is.
- Implement the ID3 decision tree learner with reduced-error pruning, as
described in Chapter 3 of Mitchell. You may program in C/C++, Java,
LISP, or Scheme. If you'd like to use another language, ask Parag first.
A bunch of C support code to do input and run experiments is available in the
UWML package .
It's entirely up to you whether to use UWML or not. Please note
that UWML is not industrial-strength code; it is still being
developed, and may have bugs, rough parts, etc.
Please send comments, questions and bug reports (only concerning UWML) to Geoff Hulten
(ghulten@cs.washington.edu),
the author of the software. Your learner should read in a training and test
set in C4.5 format.
Note that UWML provides codes for reading this format: UWML C4.5 interface.
Your program should output the decision tree in a readable format of your choice, as well as its accuracy
on the test data. You don't need to spend a lot of time making a fancy interface, output like C4.5 is fine
(See example).
You can handle numeric attributes by pre-discretizing them into equal-sized bins, and
missing values by filling in with the most frequent value given the class (or the average, for numeric values).
You will probably want to test your decision tree learner on a small, easy to understand data set before
trying the KDD Cup dataset. A large number of data sets in the C4.5 format are available
in one package.
You can also find the original datasets at the
UCI Machine Learning Repository.
You may even want to construct a very simple data set based on a boolean formula.
-
Read the KDD Cup 2000 competition report,
and browse the online documentation.
-
Download the cleaned-up training and test sets we have produced.
Data sets now available here.
Apply your decision tree learner to this data. The prediction task is KDD
Cup's Question 1: Given a set of page views, will the visitor view another page on
the site or will the visitor leave? For a start, use this processed data set.
-
Try to improve the decision tree's predictive accuracy by modifying the
data. For example, you can try constructing new attributes from the
existing ones and augmenting the examples with them. You can also try going
back to the original clickstream data (available at the URL above) and
creating new attributes directly from it. (Warning: the original data set
is very large.)
Since
the full data set is so large we will be placing a copy of it on a file
server, which you can access with your CS account. This way your program
can directly read the file, without requiring you to store a copy. (Note
that this could be significantly slower than acquiring a local copy).
A bunch of potentially useful pointers and some free software can be found at
KD nuggets.
-
Turn in the following:
- Your code, and reasonable documentation for it (i.e., enough for us
to understand how it works and how to use it). Please place this documentation
in a file named
README .
- A report of at most 5 pages (letter size, 1in margins, 12pt font)
describing the improvements you tried and why, the accuracies you
obtained with the various versions, and what you found (i.e., what
you know about answering Question 1 that you didn't before).
- A breakdown of what each group member did in the project.
Turn-in procedure:
Updated: Follow the link
for turning in your project. The deadline for submission
is before class on Wed, May 7. Again, Word/PS/PDF/HTML/plain
text files are ok for the report.
-
We may ask you to do a demo of your system and/or an oral discussion.
Good luck, and have fun!
|