Due: Monday, October 23, 2017 at 12:01am PST.
The project is based on a task posed in KDD Cup 2000. It involves mining clickstream data collected from Gazelle.com, which sold legware products. Please browse the KDD Cup website to understand the domain, and read the organizer's report.
Your task is Question 1 from the KDD Cup: Given a set of page views, will the visitor view another page on the site or will he leave?
Download the discretized (no numerical attributes), cleaned-up training and test datasets. The data is in Attribute-Relation File Format (ARFF).
Implement the ID3 decision tree learner, as described in Chapter 8 of Duda et al. or Chapter 3 of Mitchell. Another very good description of the algorithm can be found in the original ID3 paper [1], which is pretty readable.
For initial debugging, it is recommended that you construct a very simple data set (e.g., based on a boolean formula) and test your program on it.
Implement split stopping using chi-square test. The test-statistic is:
\[ \sum_{i=1}^v \frac{(p_i-p'_i)^2}{p'_i} + \frac{(n_i-n'_i)^2}{n'_i} \]
where \(p_i\) is the number of positive/true examples (i.e. customers who will stay on the website) in the i-th child, \(p'_i\) is the expected number of positive examples in the i-th child, \(n_i\) is the number of negative examples in the i-th child and \(n'_i\) is the expected number of negative examples in the i-th child. The degree of freedom is $v-1$, where $v$ is the number of possible values of attribute the node is split according to.
For more detailed explanation, see 'page 93' of the original ID3 paper [1]. Duda et al. 8.3.3 also provides some details on chi-square test. If you forgot all you learned in Stat 101, check out Appendix A.6 in Duda et al. Note that the values in header of Table A.2 in Duda et al. actually refer to the p-value, which is 1 minus the confidence level.
You may find the functions scipy.stats.chi2.isf in Python or chi2inv in MATLAB useful for computing the critical values of chi-square.
Use your algorithm to train a decision tree classifier at each of the following confidence levels: 0% (i.e., the algorithm grows the full tree), 50%, 80%, 95% and 99%. Evaluate the trees' accuracy, precision and recall on the training data and on test data (separately).
1.1 Write a paragraph describing how your code works.
1.2 On the training set, what is the accuracy of the tree trained at each confidence level? What are the precision and recall?
1.3 On the test set, what is the accuracy of the tree trained at each confidence level? What are the precision and recall?
1.4 How many decision nodes are there in the tree trained at each confidence level?
1.5 Which level of confidence produces the best results and why? What is your criterion for "the best results"?
1.6 Try to interpret the tree with the highest accuracy (on the test set). What are the first 4 decisions (or less if the tree has less than 4 levels) applied to the largest fraction of the positively-labeled training examples? Negatively-labeled? Express these paths as English sentences. Do they make sense?
1.7 If all your accuracies are low, how have you tried to improve them and what do you suspect is failing?
Extra-credit: Try to improve the predictive accuracy you obtained above. There are many approaches you could try. You could pick the best attribute to split on or handle unknown values differently. You could construct new attributes from the existing ones and augment the examples with them. You could try alternative classification techniques, or modify one that you used above. You could also go to the raw clickstream data provided on the KDD Cup website (this is separate from the data aggregated by session) and create new attributes directly from it. Note that this dataset is very large. A bunch of potentially useful pointers and free software can be found at KDnuggets.
[1] J. R. Quinlan. "Induction of decision trees". Machine Learning 1 (1) (1986): 81-106.
2.1 (Mitchell 10.1) Consider a sequential covering algorithm such as CN2 and a simultaneous covering algorithm such as ID3. Both algorithms are to be used to learn a target concept defined over instances represented by conjunctions of n boolean attributes. If ID3 learns a balanced decision tree of height $d$, it will contain $2^d-1$ distinct decisions nodes, and therefore will have made $2^d-1$ distinct choices while constructing its output hypothesis.
2.2 Suppose FOIL is considering adding a literal to a clause using a binary predicate $P$.