CSE 592 - Data Mining - Spring 2001 - Homework 1

Due Date: 11 April 2001 6:30pm (The third week of class)

Turn-in procedure: Please email grimes@cs.washington.edu before class on 11 April. I would highly prefer not to get Word documents. Postscript, PDF, HTML, or plain text would be much easier for me to read (on my Linux machine). Please email me if you don't know how to generate any of these formats from Word. Note that zipping (with either WinZip/PKZip or gzip) Postscript files is a really good idea.

Please use the subject "CSE592: HW1 Submission", and in the text part of the message include your name and student id.

Notes: All homeworks are to be done individually. H&K refers to the Data Mining text by Han and Kamber. Mitchell refers to the Machine Learning text by Mitchell. The questions from Mitchell are included below due to the problem some people have had getting the book.

H&K - 2.4
H&K - 2.6 Note: By "Design a data warehouse" we mean: "Select a schema (e.g, star, snowflake) and design the fact table and dimension tables (i.e., choose the dimensions and measures to include, etc.). Justify your choices, discussing the relevant issues and alternatives."
H&K - 2.7(b)
Mitchell - 3.1 : Give decision trees to represent the following boolean functions:
- (a) A AND !B
- (b) A OR [B AND C]
- (c) A XOR B
- (d) [A AND B] OR [C AND D]
Mitchell - 3.2 : Consider the following set of training examples

Instance Classification A1 A2
1 + T T
2 + T T
3 - T F
4 + F F
5 - F T
6 - F T
- (a) What is the entropy of this collection of training examples with respect to the target function classification?
- (b) What is the information gain of A2 relative to these training examples?
Why is the validation set used for pruning a decision tree usually smaller than the one used to grow the tree?