## CSEP 546: Data Mining (Autumn 2017) Assignment 1: Decision Trees and Rule Induction

Due: Monday, October 23, 2017 at 12:01am PST.

• Please submit both code and writeup at the submission dropbox.
• Provide all source code (sufficiently commented) so that we can run it ourselves.
• For the programming question you may use C/C++/C#, Java, Python or MATLAB. However, Python and MATLAB are highly recommended, as they are higher level languages.
• You may use libraries for parsing the data, but you should include any non-standard libraries used in your submission. You may use standard libraries such as "math.h", "scipy" and "numpy" without including them in your submission.
• The writeup should be a typed PDF limited to four pages with reasonable fonts and margins. Submit your writeup as a separate file (i.e., not inside an archive file containing your code).
• Check the discussion board for answers to questions you may have about the assignment or post new questions.
• Follow the discussion board and announcements page for updates and clarifications.

### Problem 1: Clickstream Mining with Decision Trees

The project is based on a task posed in KDD Cup 2000. It involves mining clickstream data collected from Gazelle.com, which sold legware products. Please browse the KDD Cup website to understand the domain, and read the organizer's report.
Your task is Question 1 from the KDD Cup: Given a set of page views, will the visitor view another page on the site or will he leave?

• Download the discretized (no numerical attributes), cleaned-up training and test datasets. The data is in Attribute-Relation File Format (ARFF).

• Implement the ID3 decision tree learner, as described in Chapter 8 of Duda et al. or Chapter 3 of Mitchell. Another very good description of the algorithm can be found in the original ID3 paper [1], which is pretty readable.

• As some attributes in the dataset can take many values, your program should use the GainRatio (and not the Information Gain) for picking the best attribute.
• Your program should handle unknown values by assuming the most common value of the attribute among other examples directed to that node with the same target value. Note that at test time you cannot use the target values for anything other than evaluation, so you will need to handle unknown values by assuming to most common value among the examples directed to that node during training. Note that this should be done at every node and it is not a pre-processing step.
• Your program should assume input in ARFF.

For initial debugging, it is recommended that you construct a very simple data set (e.g., based on a boolean formula) and test your program on it.

• Implement split stopping using chi-square test. The test-statistic is:
$\sum_{i=1}^v \frac{(p_i-p'_i)^2}{p'_i} + \frac{(n_i-n'_i)^2}{n'_i}$
where $p_i$ is the number of positive/true examples (i.e. customers who will stay on the website) in the i-th child, $p'_i$ is the expected number of positive examples in the i-th child, $n_i$ is the number of negative examples in the i-th child and $n'_i$ is the expected number of negative examples in the i-th child. The degree of freedom is $v-1$, where $v$ is the number of possible values of attribute the node is split according to.

For more detailed explanation, see 'page 93' of the original ID3 paper [1]. Duda et al. 8.3.3 also provides some details on chi-square test. If you forgot all you learned in Stat 101, check out Appendix A.6 in Duda et al. Note that the values in header of Table A.2 in Duda et al. actually refer to the p-value, which is 1 minus the confidence level.

You may find the functions scipy.stats.chi2.isf in Python or chi2inv in MATLAB useful for computing the critical values of chi-square.

• Use your algorithm to train a decision tree classifier at each of the following confidence levels: 0% (i.e., the algorithm grows the full tree), 50%, 80%, 95% and 99%. Evaluate the trees' accuracy, precision and recall on the training data and on test data (separately).

#### Problem writeup:

1.1 Write a paragraph describing how your code works.

1.2 On the training set, what is the accuracy of the tree trained at each confidence level? What are the precision and recall?

1.3 On the test set, what is the accuracy of the tree trained at each confidence level? What are the precision and recall?

1.4 How many decision nodes are there in the tree trained at each confidence level?

1.5 Which level of confidence produces the best results and why? What is your criterion for "the best results"?

1.6 Try to interpret the tree with the highest accuracy (on the test set). What are the first 4 decisions (or less if the tree has less than 4 levels) applied to the largest fraction of the positively-labeled training examples? Negatively-labeled? Express these paths as English sentences. Do they make sense?

1.7 If all your accuracies are low, how have you tried to improve them and what do you suspect is failing?

Extra-credit: Try to improve the predictive accuracy you obtained above. There are many approaches you could try. You could pick the best attribute to split on or handle unknown values differently. You could construct new attributes from the existing ones and augment the examples with them. You could try alternative classification techniques, or modify one that you used above. You could also go to the raw clickstream data provided on the KDD Cup website (this is separate from the data aggregated by session) and create new attributes directly from it. Note that this dataset is very large. A bunch of potentially useful pointers and free software can be found at KDnuggets.

[1] J. R. Quinlan. "Induction of decision trees". Machine Learning 1 (1) (1986): 81-106.

### Problem 2: Rule Induction

2.1 (Mitchell 10.1) Consider a sequential covering algorithm such as CN2 and a simultaneous covering algorithm such as ID3. Both algorithms are to be used to learn a target concept defined over instances represented by conjunctions of n boolean attributes. If ID3 learns a balanced decision tree of height $d$, it will contain $2^d-1$ distinct decisions nodes, and therefore will have made $2^d-1$ distinct choices while constructing its output hypothesis.

1. How many rules will be formed if this tree is re-expressed as a disjunctive set of rules?
2. How many preconditions will each rule possess?
3. How many distinct choices would a sequential covering algorithm have to make to learn this same set of rules?
4. Which system do you suspect would be more prone to overfitting if both were given the same training data? Why?

2.2 Suppose FOIL is considering adding a literal to a clause using a binary predicate $P$.

1. If the clause currently contains $n$ different variables, how many distinct literals can be generated? Two literals are considered identical if they differ only in the names of the new variables that they contain.
2. Why do you think FOIL doesn’t allow literals that contain only new variables, i.e. no previously used variables?