image University of Washington Computer Science & Engineering
 CSE 527, Au '04: Miscellaneous exercises
  CSE Home   About Us    Search    Contact Info 

If you'd like some practice with some of the material this quarter, here are a few suggested exercises. You do not need to turn in any of these, but I'm happy to look at/discuss your solutions if that's useful. Some of these problems are very open-ended.
  1. Clustering: In lecture, I was vague about the specifics of implementing hierarchical clustering. Give a more detailed algorithm for implementing complete-, average-, or centroid-link hierarchical clustering and analyze it's asymptotic running time as a function of the number n of data points being clustered. Assume your input is the n by n matrix of pairwise distances between data points. You may assume that the distance matrix is symmetrical (and that it satisfies the triangle inequality, if that helps).
  2. Clustering: Average- and centroid-link clustering seem rather similar on the surface. Give a small example where they differ. Hint: the centroid of a set of points stays the same if the points are rotated about it.
  3. Clustering: Work through the derivation of the k-means-like clustering algorithm for univariate data (points on the real line, k>= 2 clusters, different means but common variances) as an example of EM. (Recall that there is a slight difference between them; k-means makes a specific assignment of each point to one cluster in each iteration, whereas EM uses weighted averages based on membership probabilities. In short, k-means uses "classification EM".)
  4. Clustering: Can you find a univariate example where k-means (or the model-based EM clustering algorithm) will converge to a local optimum that is inferior to the global optimum?
  5. Clustering: One could imagine doing a "model-based hierarchical clustering": as in standard hierarchical clustering, at each step you merge the two most similar clusters, but in this case, "similarity" is somehow defined to capture goodness of fit to the probabilistic model. Outline a reasonable theory and algorithm for doing this. (One issue to worry about is that small clusters, esp. clusters of size 1, will have degenerate variance estimates.)
  6. Clustering: I was also vague in lecture about the initialization step for model-based clustering. One could do a random initialization, as in k-means, but this tends to be expensive, since (a) the initial clustering may be very far from and optimal one, hence requiring many iterations to converge, (b) it's more likely to converge to a sub-optimal solution, hence requiring random restarts. And the whole thing is exacerbated if you want to do the BIC analysis, which requires doing clustering for many different numbers of clusters and different choices of model types (equal-volume spherical, axis-aligned ellipsoids, etc.). Explore using the solution to the previous problem for this, or invent a better way.
  7. Maximum Likelihood: Let x1, x2, ..., xn be n samples of a normal random variable X with mean theta1 and variance theta2. In class I showed that the maximum likelihood estimates of theta 1 and theta2 when both are unknown give a biased estimate of theta2. What is the MLE of theta2 if theta1=mu is assumed to be known? Is it biased?
  8. Maximum Likelihood: Suppose X is a random variable uniformly distributed between 0 and theta>0 for some unknown theta. Based on a sample x1, x2, ..., xn of X, what is the maximum likelihood estimator of theta?
  9. Motif Finding: Pick 10-20 genes from one prokaryotic organism, say E. coli, and run one of the motif finding tools we've discussed (MEME, Align-ace, ...) on the 200 base-pair region upstream of each. Does it find anything interesting, perhaps a TATA box? Repeat with 10-20 human genes. For a somewhat more ambitious exercise, try Footprinter on 10-20 orthologous genes.
  10. Phylogeny: Hand-simulate Fich's algorithm on a small example to calculate its parsimony score.
  11. RNA Folding: There are some good, curated databases of RNA structures & alignments, e.g. for tRNAs and rRNAs, and the "seed alignments" in Rfam. Pick a few of these sequences, fold them with an RNA structure prediction program such as the Vienna RNA package, and compare to the hand-curated structure.
  12. Linear and Quadratic Discriminant Analysis: Suppose "positive" examples of some trait are drawn from a 2-D Gaussian distributation with fixed mean and variance/covariance, and that "negative" examples are drawn from another Gaussian with a different mean, but the same variance/covariance. Consider a single point x drawn at random from one or the other of these distributions. A natural way to "classify" the point as to whether it is a positive or negative example is to look at the relative likelihoods of the point being drawn from Gaussian 1 vs Gaussian 2. There will be points that are equally likely under the two models. Show that the set of such points is a straight line, and that all points on one side of the line are more likely under one of the two models, and all points on the other side are more likely under the other model. (This is "linear discriminant analysis" or LDA.) Repeat this assuming the two gaussians do not have the same variance/covariance matrix. In this case, the decision boundary will be quadratic (i.e., this is QDA). Generalization to higher dimension should be straightforward.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to cse527-webmaster@cs.washington.edu]