CSEP 546 - Data Mining - Spring 2003 - Project 1:

Clickstream Mining

Due Date: Wed, May 7, 2003 at 6:30 PM (The sixth week of class)

  1. Form a group of two, and email Parag (parag@cs.washington.edu) by Monday the 21st saying who your group is.

  2. Implement the ID3 decision tree learner with reduced-error pruning, as described in Chapter 3 of Mitchell. You may program in C/C++, Java, LISP, or Scheme. If you'd like to use another language, ask Parag first. A bunch of C support code to do input and run experiments is available in the UWML package . It's entirely up to you whether to use UWML or not. Please note that UWML is not industrial-strength code; it is still being developed, and may have bugs, rough parts, etc.

    Please send comments, questions and bug reports (only concerning UWML) to Geoff Hulten (ghulten@cs.washington.edu), the author of the software. Your learner should read in a training and test set in C4.5 format. Note that UWML provides codes for reading this format: UWML C4.5 interface. Your program should output the decision tree in a readable format of your choice, as well as its accuracy on the test data. You don't need to spend a lot of time making a fancy interface, output like C4.5 is fine (See example). You can handle numeric attributes by pre-discretizing them into equal-sized bins, and missing values by filling in with the most frequent value given the class (or the average, for numeric values).

    You will probably want to test your decision tree learner on a small, easy to understand data set before trying the KDD Cup dataset. A large number of data sets in the C4.5 format are available in one package. You can also find the original datasets at the UCI Machine Learning Repository. You may even want to construct a very simple data set based on a boolean formula.

  3. Read the KDD Cup 2000 competition report, and browse the online documentation.

  4. Download the cleaned-up training and test sets we have produced. Data sets now available here. Apply your decision tree learner to this data. The prediction task is KDD Cup's Question 1: Given a set of page views, will the visitor view another page on the site or will the visitor leave? For a start, use this processed data set.

  5. Try to improve the decision tree's predictive accuracy by modifying the data. For example, you can try constructing new attributes from the existing ones and augmenting the examples with them. You can also try going back to the original clickstream data (available at the URL above) and creating new attributes directly from it. (Warning: the original data set is very large.)

    Since the full data set is so large we will be placing a copy of it on a file server, which you can access with your CS account. This way your program can directly read the file, without requiring you to store a copy. (Note that this could be significantly slower than acquiring a local copy).

    A bunch of potentially useful pointers and some free software can be found at KD nuggets.

  6. Turn in the following:
    • Your code, and reasonable documentation for it (i.e., enough for us to understand how it works and how to use it). Please place this documentation in a file named README.

    • A report of at most 5 pages (letter size, 1in margins, 12pt font) describing the improvements you tried and why, the accuracies you obtained with the various versions, and what you found (i.e., what you know about answering Question 1 that you didn't before).

    • A breakdown of what each group member did in the project.

    Turn-in procedure: Updated: Follow the link for turning in your project. The deadline for submission is before class on Wed, May 7. Again, Word/PS/PDF/HTML/plain text files are ok for the report.

  7. We may ask you to do a demo of your system and/or an oral discussion.

Good luck, and have fun!