Assignment 1: Clickstream Mining and Rule Induction

Please submit both code and writeup online by 12:01am PST on Saturday, April 23, 2016. Please provide all code (sufficiently commented) so that we can run it ourselves. Submit your writeup as a PDF and limit to four pages with reasonable fonts and margins.

1.0 The project is based on a task posed in KDD Cup 2000. It involves mining clickstream data collected from Gazelle.com, which sells legware products. Please browse the KDD Cup website to understand the domain, and read the organizer's report. Your task is Question 1 from the KDD Cup: Given a set of page views, will the visitor view another page on the site or will he leave?

1.1 Download the discretized (no numerical attributes), cleaned-up training and test datasets from here. The data is in Attribute-Relation File Format (ARFF), which is documented here.

1.2 Implement the ID3 decision tree learner, as described in Chapter 8 of Duda et al. or Chapter 3 of Mitchell. Another very good description of the algorithm can be found in the original ID3 paper [1], which is pretty readable (pdf). You may program in C/C#/C++, Python or Java. Your program should assume input in the above format. For initial debugging, it is recommended that you construct a very simple data set (e.g., based on a boolean formula) and test your program on it.1.3 Implement split stopping using chi-square test. The test-statistic is:

\[ \sum_{i=1}^v \frac{(p_i-p'_i)^2}{p'_i} + \frac{(n_i-n'_i)^2}{n'_i} \]

where \(p_i\) is the number of positive/true instances (i.e. customers who will stay on the website) in the i-th child, \(p'_i\) is the expected number of positive instances in the i-th child, \(n_i\) is the number of negative instances in the i-th child and \(n_i\) is the expected number of negative instances in the i-th child. The degree of freedom is v-1. For more detailed explanation, see 'page 93' of the original ID3 paper [1]. Duda et al. 8.3.3 also provides some details on chi-square test. (If you forgot all you learned in Stat 101, check out Appendix A.6 in Duda et al.)

1.4 Use your algorithm to train a decision tree classifier and report accuracy on the test set. Compare accuracies by varying the confidence level in determining split stopping. Use 99%, 95% and 0% (i.e., you always grow the full tree).

**Problem writeup: **

- A high-level description on how your code works.
- The accuracies you obtain under various settings.
- Explain which options work well and why.
- Try to interpret your best performing tree. What path from root to leaf seems to explain the largest fraction of the true-labeled training data? False-labeled? Express these paths as English sentences; do they make sense?
- If all your accuracies are low, tell us what you have tried to improve the accuracies and what you suspect is failing.

**Extra-credit:** Try to improve the predictive accuracy you obtained above. There are many approaches you could try. You could construct new attributes from the existing ones and augment the examples with them. You could try alternative classification techniques, or modify one that you used above. You could also go to the raw clickstream data provided on the KDD Cup website (this is separate from the data aggregated by session) and create new attributes directly from it. Note that this dataset is very large. A bunch of potentially useful pointers and free software can be found at KDnuggets.

[1] J. R. Quinlan, 'Induction of decision trees.' Machine Learning 1 (1) (1986): 81-106 (pdf)

Mitchell 10.1 Consider a sequential covering algorithm such as CN2 and a simultaneous covering algorithm such as ID3. Both algorithms are to be used to learn a target concept defined over instances represented by conjunctions of *n* boolean attributes. If ID3 learns a balanced decision tree of depth *d*, it will contain \( 2^d-1\) distinct decisions nodes, and therefore will have made \( 2^d-1 \) distinct choices while constructing its output hypothesis. How many rules will be formed if this tree is re-expressed as a disjunctive set of rules? How many preconditions will each rule possess? How many distinct choices would a sequential covering algorithm have to make to learn this same set of rules? Which system do you suspect would be more prone to overfitting if both were given the same training data?