Due Date: Wed, May 2, 2007. We
would prefer that you turn in a hard copy of your report at the start
of class. Otherwise, you can email it to Bhushan. Your report must
contain your name at the top, and can be in any of pdf, Word or
- The project is based on a task
posed in KDD Cup 2000. It involves mining clickstream data collected
from Gazelle.com, which sells legware products. Please browse the KDD
Cup website to
understand the domain, and read the organizer's report. Your task is
Question 1 from the KDD Cup: Given a set of page views, will the
visitor view another page on the site or will he leave?
- Download the cleaned-up training
and test datasets from here.
- You will use the Weka data
mining package for this project. Browse the Weka website, and look
at their documentation..
In particular, the user guide for the Weka Explorer will help you get
quickly started. You can download a small dataset from here
to familiarize yourself with Weka before you crunch the dataset for
- Apply the following
classification algorithms on this problem. You have been given separate
training and test datasets. Train your classifier on the former, and
report the accuracy you obtain on the latter. Understand the parameters
of the various classifiers, and experiment with them to see how is
performance affected, and what works best.
- Decision Trees: The J48
classifier available in weka.classifiers.trees is a variant of the
popular C4.5 decision tree algorithm.
- Naïve Bayes: Available in
- Rule Learners: A variety of
rule learners are provided in weka.classifiers.rules. Choose one of
- Try to improve the predictive
accuracy you obtained above. There are many approaches you could try.
You could construct new attributes from the existing ones and augment
the examples with them. You could try alternative classification
techniques, or modify one that you used above (the source code for Weka
can be downloaded from their website). You could also go to the raw
clickstream data provided on the KDD Cup website (this is separate from
the data aggregated by session) and create new attributes directly from
it. Note that this dataset is very large. A bunch of potentially useful
pointers and free software can be found at KD nuggets.
- Turn in a report of at most 4
pages (letter size, 1 inch margins, 12pt font) describing what you did,
the improvements you tried and why, and the accuracies you obtained
with the various versions. Also, what insights did you obtain about the
given prediction task.
Good luck, and have fun!