CSE P546 Data Mining - Spring 2007 - Project 1

Clickstream Mining

Due Date: Wed, May 2, 2007. We would prefer that you turn in a hard copy of your report at the start of class. Otherwise, you can email it to Bhushan. Your report must contain your name at the top, and can be in any of pdf, Word or plaintext formats.

  1. The project is based on a task posed in KDD Cup 2000. It involves mining clickstream data collected from Gazelle.com, which sells legware products. Please browse the KDD Cup website to understand the domain,  and read the organizer's report. Your task is Question 1 from the KDD Cup: Given a set of page views, will the visitor view another page on the site or will he leave?
  2. Download the cleaned-up training and test datasets from here.
  3. You will use the Weka data mining package for this project. Browse the Weka website, and look at their documentation.. In particular, the user guide for the Weka Explorer will help you get quickly started. You can download a small dataset from here to familiarize yourself with Weka before you crunch the dataset for this project.
  4. Apply the following classification algorithms on this problem. You have been given separate training and test datasets. Train your classifier on the former, and report the accuracy you obtain on the latter. Understand the parameters of the various classifiers, and experiment with them to see how is performance affected, and what works best.
    1. Decision Trees: The J48 classifier available in weka.classifiers.trees is a variant of the popular C4.5 decision tree algorithm.
    2. Naïve Bayes: Available in weka.classifiers.bayes.
    3. Rule Learners: A variety of rule learners are provided in weka.classifiers.rules. Choose one of these.
  1. Try to improve the predictive accuracy you obtained above. There are many approaches you could try. You could construct new attributes from the existing ones and augment the examples with them. You could try alternative classification techniques, or modify one that you used above (the source code for Weka can be downloaded from their website). You could also go to the raw clickstream data provided on the KDD Cup website (this is separate from the data aggregated by session) and create new attributes directly from it. Note that this dataset is very large. A bunch of potentially useful pointers and free software can be found at KD nuggets.
  2. Turn in a report of at most 4 pages (letter size, 1 inch margins, 12pt font) describing what you did, the improvements you tried and why, and the accuracies you obtained with the various versions. Also, what insights did you obtain about the given prediction task.

Good luck, and have fun!