CSE 592 - Data Mining - Spring 2001 - Homework 2

Due Date: 9 May 2001 6:30pm (The seventh week of class)

Turn-in procedure: Please email grimes@cs.washington.edu before class on 9 May.
This time I will be accepting Word documents, in addition to Postscript, PDF, HTML, or plain text files. Note that zipping up Postscript files is a good idea if you have large figures or rasterized images.

Please use the subject "CSE592: HW2 Submission", and in the text part of the message include your name and student id.

Notes: Mitchell refers to the Machine Learning text by Mitchell.

What is the asymptotic worst-case time complexity of the sequential covering (or "separate and conquer") algorithm for rule induction, as a function of the number of examples n and the number of attributes d? Assume that greedy search is used, and that error rate is the evaluation measure used.
If the result of resolving the clause A OR B OR !D with the clause X is the clause A OR !D OR E, what could X be? Give all possible answers. (! is used to indicate Boolean NOT)
Suppose that A, B and C are symptoms of an illness D, independent of each other given D, and that E, F and G are conditions that make someone more likely to develop D, independent among themselves. Draw a Bayesian network representing these relationships.
Consider the following Bayesian network, in which the variables X, Y and Z are Boolean:
- a) Is Y independent of Z?
- b) Is X independent of Z given Y?
- c) Suppose you are given the following training set, where "?" denotes a missing value:
```
       X  Y  Z
       -------
       1  0  1
       0  1  ?
       0  0  1
       0  0  ?
    
```
  Show the sequence of parameter estimates and filled-in values produced by the EM algorithm.
Draw a neural network equivalent to the decision tree in Figure 3.1 of Mitchell (pg 53). Specify the architecture (connections among neurons, inputs and outputs) and weights. Assume that: Humidity is represented as a Boolean variable with value 1 if the humidity is high and value 0 if the humidity is normal; Wind is represented as a Boolean variable with value 1 if the wind is strong and value 0 if the wind is weak; and Outlook is represented as three Boolean variables O1, O2 and O3, such that O1 = 1 when the outlook is sunny and 0 otherwise, O2 = 1 when the outlook is overcast and 0 otherwise, and O3 = 1 if the outlook is rainy and 0 otherwise.
Consider a population in a genetic algorithm composed of three hypotheses h1, h2 and h3, whose fitnesses are respectively a1 = 10, a2 = 0 and a3 = 1. Suppose that the substitution rate (fraction of the population to be replaced by crossover at each step) is one third, the mutation rate is zero, and rank selection is used. Compute each hypothesis' probability of surviving unchanged into the next generation.