CSEP 546: Data Mining (Autumn 2017)
Assignment 1: Decision Trees and Rule Induction

Due: Monday, October 23, 2017 at 12:01am PST.

Problem 1: Clickstream Mining with Decision Trees

The project is based on a task posed in KDD Cup 2000. It involves mining clickstream data collected from Gazelle.com, which sold legware products. Please browse the KDD Cup website to understand the domain, and read the organizer's report.
Your task is Question 1 from the KDD Cup: Given a set of page views, will the visitor view another page on the site or will he leave?

Problem writeup:

1.1 Write a paragraph describing how your code works.

1.2 On the training set, what is the accuracy of the tree trained at each confidence level? What are the precision and recall?

1.3 On the test set, what is the accuracy of the tree trained at each confidence level? What are the precision and recall?

1.4 How many decision nodes are there in the tree trained at each confidence level?

1.5 Which level of confidence produces the best results and why? What is your criterion for "the best results"?

1.6 Try to interpret the tree with the highest accuracy (on the test set). What are the first 4 decisions (or less if the tree has less than 4 levels) applied to the largest fraction of the positively-labeled training examples? Negatively-labeled? Express these paths as English sentences. Do they make sense?

1.7 If all your accuracies are low, how have you tried to improve them and what do you suspect is failing?

Extra-credit: Try to improve the predictive accuracy you obtained above. There are many approaches you could try. You could pick the best attribute to split on or handle unknown values differently. You could construct new attributes from the existing ones and augment the examples with them. You could try alternative classification techniques, or modify one that you used above. You could also go to the raw clickstream data provided on the KDD Cup website (this is separate from the data aggregated by session) and create new attributes directly from it. Note that this dataset is very large. A bunch of potentially useful pointers and free software can be found at KDnuggets.

[1] J. R. Quinlan. "Induction of decision trees". Machine Learning 1 (1) (1986): 81-106.

Problem 2: Rule Induction

2.1 (Mitchell 10.1) Consider a sequential covering algorithm such as CN2 and a simultaneous covering algorithm such as ID3. Both algorithms are to be used to learn a target concept defined over instances represented by conjunctions of n boolean attributes. If ID3 learns a balanced decision tree of height $d$, it will contain $2^d-1$ distinct decisions nodes, and therefore will have made $2^d-1$ distinct choices while constructing its output hypothesis.

  1. How many rules will be formed if this tree is re-expressed as a disjunctive set of rules?
  2. How many preconditions will each rule possess?
  3. How many distinct choices would a sequential covering algorithm have to make to learn this same set of rules?
  4. Which system do you suspect would be more prone to overfitting if both were given the same training data? Why?

2.2 Suppose FOIL is considering adding a literal to a clause using a binary predicate $P$.

  1. If the clause currently contains $n$ different variables, how many distinct literals can be generated? Two literals are considered identical if they differ only in the names of the new variables that they contain.
  2. Why do you think FOIL doesn’t allow literals that contain only new variables, i.e. no previously used variables?