Assignment 1: Clickstream Mining and Rule Induction

Please submit both code and writeup online by 12:01am PST on Monday, April 24, 2017. Please provide all code (sufficiently commented) so that we can run it ourselves. Submit your writeup as a PDF and limit to four pages with reasonable fonts and margins.

1.0 The project is based on a task posed in KDD Cup 2000. It involves mining clickstream data collected from Gazelle.com, which sells legware products. Please browse the KDD Cup website to understand the domain, and read the organizer's report. Your task is Question 1 from the KDD Cup: Given a set of page views, will the visitor view another page on the site or will he leave?

1.1 Download the discretized (no numerical attributes), cleaned-up training and test datasets from here. The data is in Attribute-Relation File Format (ARFF), which is documented here.

1.2 Implement the ID3 decision tree learner, as described in Chapter 8 of Duda et al. or Chapter 3 of Mitchell. Another very good description of the algorithm can be found in the original ID3 paper [1], which is pretty readable (pdf). You may program in C/C++, Python, Java or MATLAB. Your program should assume input in the above format. For initial debugging, it is recommended that you construct a very simple data set (e.g., based on a boolean formula) and test your program on it.1.3 Implement split stopping using chi-square test. The test-statistic is:

\[ \sum_{i=1}^v \frac{(p_i-p'_i)^2}{p'_i} + \frac{(n_i-n'_i)^2}{n'_i} \]

where \(p_i\) is the number of positive/true instances (i.e. customers who will stay on the website) in the i-th child, \(p'_i\) is the expected number of positive instances in the i-th child, \(n_i\) is the number of negative instances in the i-th child and \(n_i\) is the expected number of negative instances in the i-th child. The degree of freedom is v-1. For more detailed explanation, see 'page 93' of the original ID3 paper [1]. Duda et al. 8.3.3 also provides some details on chi-square test. (If you forgot all you learned in Stat 101, check out Appendix A.6 in Duda et al.)

1.4 Use your algorithm to train a decision tree classifier and report accuracy on the test set. Compare accuracies by varying the confidence level in determining split stopping. Use 99%, 95% and 0% (i.e., you always grow the full tree).

**Problem writeup: **

- A high-level description on how your code works.
- The accuracies you obtain under various settings.
- Explain which options work well and why.
- Try to interpret your best performing tree. What path from root to leaf seems to explain the largest fraction of the true-labeled training data? False-labeled? Express these paths as English sentences; do they make sense?
- If all your accuracies are low, tell us what you have tried to improve the accuracies and what you suspect is failing.

**Extra-credit:** Try to improve the predictive accuracy you obtained above. There are many approaches you could try. You could construct new attributes from the existing ones and augment the examples with them. You could try alternative classification techniques, or modify one that you used above. You could also go to the raw clickstream data provided on the KDD Cup website (this is separate from the data aggregated by session) and create new attributes directly from it. Note that this dataset is very large. A bunch of potentially useful pointers and free software can be found at KDnuggets.

[1] J. R. Quinlan, 'Induction of decision trees.' Machine Learning 1 (1) (1986): 81-106 (pdf)

2.1 (Mitchell 10.5) Apply inverse resolution in propositional form to the clauses $C = A \lor B, C_1 = A \lor B \lor G$. Give at least two possible results for $C_2$.

2.2 (Mitchell 10.6) Apply inverse resolution to the clauses $C = R(B,x) \lor P(x,A), C_1 = S(B,y) \lor R(z,x)$. Give at least four possible results for $C_2$. Here $A$ and $B$ are constants, $x$ and $y$ are variables.