CSEP 546 - Data Mining - Autumn 2004 - Homework 1

Due Date: 19 October 2003 6:30pm (The third week of class)

Turn-in procedure: Email your homework file(s) to parag@cs.washington.edu before class on 19 October. Any of the Word, Postscript, PDF, HTML, or Plain Text formats should be fine.

Please use the subject "CSEP546: HW1 Submission", and in the text part of the message include your name and student id.

You can also submit a hardcopy of your work at the beginning of the class on October 19th.

Notes: All homeworks are to be done individually. H&K refers to the Data Mining text by Han and Kamber.

  1. H&K - 2.4

  2. H&K - 2.6
    Note: By "Design a data warehouse" we mean: "Select a schema (e.g, star, snowflake) and design the fact table and dimension tables (i.e., choose the dimensions and measures to include, etc.). Justify your choices, discussing the relevant issues and alternatives."

  3. H&K - 2.7 b)

  4. Suppose that a learning algorithm is trying to find a consistent hypothesis when the (Boolean) classes of examples are actually completely random (i.e., the class of an example is 0 or 1 with equal probability). If there are n Boolean attributes and the training set is composed of m examples drawn uniformly without replacement from the set of 2^n possible examples, what is the probability of finding a contradiction (i.e., the same example with different classes) in the data?
    Note: By "without replacement" we mean that the same example with the same class cannot appear more than once in the data, but it could appear once with class 0 and once with class 1.

  5. H&K - 7.6 a) b)

  6. Consider the following decision tree, where W, X and Y are the attributes tested at the nodes (all Boolean), A, B, C, D are the classes predicted at the leaves, and the left and right branches of each node correspond respectively to the values True and False. Each leaf is labeled with its accuracy on the validation set, each internal node is labeled with the accuracy on the validation set it would have if was a leaf, and each branch is labeled with the fraction of examples from the parent node that follow that branch. Which subtrees would you prune?