CSEP 546 - Data Mining - Spring 2003 - Homework 1

Due Date: 16 April 2003 6:30pm (The third week of class)

Turn-in procedure: Email your homework file(s) to parag@cs.washington.edu before class on 16 April. Any of the Word, Postscript, PDF, HTML, or Plain Text formats should be fine.

Please use the subject "CSEP546: HW1 Submission", and in the text part of the message include your name and student id.

You can also submit a hardcopy of your work at the beginning of the class on April 16th.

Notes: All homeworks are to be done individually. H&K refers to the Data Mining text by Han and Kamber.

H&K - 2.3 a), b), c)
H&K - 2.4
H&K - 2.12 a)
Suppose there are two continuous attributes X and Y, the instance space is the square {(X,Y): 0 < X < 4; 0 < Y < 4}, and examples are uniformly distributed within it. Suppose that the class is positive if X > 4 - Y, and negative otherwise.
1. Draw a decision tree with seven internal nodes that approximates this concept as closely as possible. (The only tests allowed in nodes are of the form "X < A ?" or "Y < B ?", for any constants A and B.)
2. What is the error rate of this decision tree?
3. What is the error rate of the decision tree(s) with 15 internal nodes that best approximate(s) this concept? Why?
If there are three equally probable classes, what is the highest possible information gain of a Boolean attribute? Why?
This question is about reduced-error pruning of decision trees. Suppose the validation-set error rate of the subtree rooted at node N is greater than the error rate obtained by converting N into a leaf. Is it possible that, in the optimal pruned tree, N is not a leaf? Why?