CSEP 546 - Data Mining - Autumn 2004 - Project 1:

Clickstream Mining

Due Date: Tue, Nov 9, 2004 at 6:30 PM (The sixth week of class)

  1. The project is to be done individually.

  2. Read the KDD Cup 2000 competition report, and browse the online documentation.

  3. Download the cleaned-up training and test sets we have produced. Data sets now available here. Browse through the VFML (Very Fast Machine Learning Toolkit) page. Go to the "Modules" section and within that and go to the "learning programs" sub-section. Your job is to apply vfdt (very fast decision tree learner) to this data. vfdt is described in this paper. vfdt reads in a training and test set in C4.5 format. The prediction task is KDD Cup's Question 1: Given a set of page views, will the visitor view another page on the site or will the visitor leave? For a start, use this processed data set.

    A bunch of C support code to run experiments is also available in the VFML. Please note that VFML is not industrial-strength code; it is still being developed, and may have bugs, rough parts, etc. Please send comments, questions and bug reports (only concerning VFML) to Geoff Hulten (ghulten@microsoft.com), the author of the software.

    You will probably want to test the tree learner on a small, easy to understand data set before trying the KDD Cup dataset. A large number of data sets in the C4.5 format are available in one package. You can also find the original datasets at the UCI Machine Learning Repository. You may even want to construct a very simple data set based on a boolean formula.

  4. Try to improve the decision tree's predictive accuracy by modifying the data. For example, you can try constructing new attributes from the existing ones and augmenting the examples with them. You can also try going back to the original clickstream data (available at the URL above) and creating new attributes directly from it. (Warning: the original data set is very large.)

    Since the full data set is so large we will be placing a copy of it on a file server, which you can access with your CS account. This way your program can directly read the file, without requiring you to store a copy. (Note that this could be significantly slower than acquiring a local copy).

    A bunch of potentially useful pointers and some free software can be found at KD nuggets.

  5. Turn in a report of at most 3 pages (letter size, 1in margins, 12pt font) describing what you did, the improvements you tried and why, the accuracies you obtained with the various versions, and what you found (i.e., what you know about answering Question 1 that you didn't before).

    Turn-in procedure: Email your report to parag@cs.washington.edu before class on November 9. Any of the Word, Postscript, PDF, HTML, or Plain Text formats should be fine.

    Please use the subject "CSEP546: PROJ1 Submission", and in the text part of the message include your name and student id.

    You can also submit a hardcopy of your work at the beginning of the class on November 9.

  6. We may ask you for an oral discussion.

Good luck, and have fun!