Due Date: Tue, Nov 9, 2004 at 6:30 PM (The sixth week of class)
- The project is to be done individually.
Read the KDD Cup 2000 competition report,
and browse the online documentation.
Download the cleaned-up training and test sets we have produced.
Data sets now available here.
Browse through the VFML
(Very Fast Machine Learning Toolkit) page. Go to the "Modules" section and
within that and go to the "learning programs" sub-section. Your job is to apply
vfdt (very fast decision tree learner) to this data. vfdt is described in this
paper. vfdt reads
in a training and test set in
C4.5 format. The prediction task is KDD Cup's Question 1: Given a set of page
views, will the visitor view another page on the site or will the visitor leave? For a start,
use this processed data set.
A bunch of C support code to run experiments is also available
in the VFML.
Please note that VFML is not industrial-strength code; it is still being
developed, and may have bugs, rough parts, etc.
Please send comments, questions and bug reports (only concerning VFML) to Geoff Hulten
the author of the software.
You will probably want to test the tree learner on a small, easy to understand data set
before trying the KDD Cup dataset. A large number of data sets in the C4.5 format are
in one package.
You can also find the original datasets at the
UCI Machine Learning Repository.
You may even want to construct a very simple data set based on a boolean formula.
Try to improve the decision tree's predictive accuracy by modifying the
data. For example, you can try constructing new attributes from the
existing ones and augmenting the examples with them. You can also try going
back to the original clickstream data (available at the URL above) and
creating new attributes directly from it. (Warning: the original data set
is very large.)
the full data set is so large we will be placing a copy of it on a file
server, which you can access with your CS account. This way your program
can directly read the file, without requiring you to store a copy. (Note
that this could be significantly slower than acquiring a local copy).
A bunch of potentially useful pointers and some free software can be found at
Turn in a report of at most 3 pages (letter size, 1in margins, 12pt font)
describing what you did, the improvements you tried and why, the accuracies
you obtained with the various versions, and what you found (i.e., what
you know about answering Question 1 that you didn't before).
Turn-in procedure: Email your
report to firstname.lastname@example.org
before class on November 9. Any of the Word, Postscript, PDF, HTML, or Plain Text
formats should be fine.
Please use the subject "CSEP546: PROJ1 Submission", and in the text
part of the message include your name and student id.
You can also submit a hardcopy of your work at the beginning of
the class on November 9.
We may ask you for an oral discussion.
Good luck, and have fun!