CSEP 546 - Data Mining - Autumn 2004 - Homework 2

Due Date: 16 Nov 2004 6:30pm

Turn-in procedure: Please email parag@cs.washington.edu before class on Nov 16. Any of the Word,Postscript, PDF, HTML, or Plain text formats should be fine.

Please use the subject "CSEP546: HW2 Submission", and in the text part of the message include your name and student id.

You can also submit a hardcopy of your work at the beginning of the class on Nov 16.

Notes: All homeworks are to be done individually.

  1. What is the asymptotic worst-case time complexity of the sequential covering (or "separate and conquer") algorithm for rule induction, as a function of the number of examples n and the number of attributes d? Assume that greedy search is used, and that error rate is the evaluation measure used.

  2. Planets around distant stars can sometimes be detected by the tiny wobble they cause in the star's trajectory. Suppose that 2% of all stars have planets orbiting them, 75% of stars with planets show a wobble in their trajectory, and 10% of stars without planets show a wobble due to other causes. If a star shows a wobble in its trajectory, what is the probability that it has planets orbiting it?

  3. Consider a Bayesian network with two Boolean nodes A and B, where A is the parent of B. Suppose you have a training set composed of the following examples in the form (A,B), with "?" indicating a missing value: (0,0), (0, ?), (0, 1), (0, ?), (1, 1), (1, ?). Show the sequence of filled-in values and parameter estimates produced by the EM algorithm, assuming the parameters are initialized ignoring missing values.

  4. Mitchell 4.5.

  5. Consider a population composed of three hypotheses, h1, h2 and h3, whose fitnesses are respectively f1 = 10, f2 = 0 and f3 = 1. Suppose the substitution rate is 33.3%, the mutation rate is 0%, and rank selection is used. Compute each hypothesis' probability of surviving unchanged into the next generation.

  6. Decision stumps are decision trees with only one node (the root). Which ensemble method would you use to improve the accuracy of decision stumps - bagging or boosting? Why?