CSE 592 - Data Mining - Spring 2001 - Homework 3

Due Date: 30 May 2001 6:30pm (The tenth (and last) week of class)

Turn-in procedure: Please email grimes@cs.washington.edu before class on 30 May.
I will be accepting Word documents, in addition to Postscript, PDF, HTML, or plain text files. Note that zipping up Postscript files is a good idea if you have large figures or rasterized images.

Please use the subject "CSE592: HW3 Submission", and in the text part of the message include your name and student id.

  1. Which algorithm has higher variance: bagging or boosting? And which one has higher bias? Why?

  2. Let the instance space be the rectangle {(x,y): 0 < x < 40, 0 < y < 30}, and let + and - represent positive and negative training examples, respectively. Suppose you are given the following training set:
                                 -      +
                  y       -      +      -      +
                                 -      +
                       0     10     20     30     40 
    • a) Draw the Voronoi diagram corresponding to this training set, using Euclidean distance.
    • b) If this Voronoi diagram represents the true concept, and examples are drawn with uniform probability from the rectangle {(x,y): 0 < x < 40, 0 < y < 30}, what is the probability of drawing a positive example?

  3. Suppose we have the following training set for y as a function of x, where examples are in the form (x,y): {(0,0), (1,2), (2,3), (3,3), (4,0), (5,1), (6,0)}. Suppose we model y=f(x) using locally weighted linear regression, minimizing the squared error over the 2 nearest neighbors. Draw a graph of the function y=f(x) implicitly induced.

  4. Let H be the set of hypotheses of the form x = x_0 v x = x_1, with x_0, x_1 \in X (i.e., x_0 and x_1 are arbitrary members of the instance space). What is the VC dimension of H?

  5. Question 6.3 in Han & Kamber, with modified part a):
    • a) Find all frequent itemsets using Apriori.
    • b) Same as Han & Kamber.