If you'd like some practice with some of the material this quarter,
here are a few suggested exercises. You do not need to turn in
any of these, but I'm happy to look at/discuss your solutions
if that's useful. Some of these problems are very open-ended.
Clustering: In lecture, I was vague about the
specifics of implementing hierarchical clustering. Give a more
detailed algorithm for implementing complete-, average-, or
centroid-link hierarchical clustering and analyze it's
asymptotic running time as a function of the number n of
data points being clustered. Assume your input is the n
by n matrix of pairwise distances between data points.
You may assume that the distance matrix is symmetrical (and that
it satisfies the triangle inequality, if that helps).
Clustering: Average- and centroid-link clustering
seem rather similar on the surface. Give a small example where
they differ. Hint: the centroid of a set of points stays the
same if the points are rotated about it.
Clustering: Work through the derivation of the
k-means-like clustering algorithm for univariate data (points on
the real line, k>= 2 clusters, different means but common
variances) as an example of EM. (Recall that there is a slight
difference between them; k-means makes a specific assignment of
each point to one cluster in each iteration, whereas EM uses
weighted averages based on membership probabilities. In short,
k-means uses "classification EM".)
Clustering: Can you find a univariate example where
k-means (or the model-based EM clustering algorithm) will
converge to a local optimum that is inferior to the global
optimum?
Clustering: One could imagine doing a "model-based
hierarchical clustering": as in standard hierarchical
clustering, at each step you merge the two most similar
clusters, but in this case, "similarity" is somehow defined to
capture goodness of fit to the probabilistic model. Outline a
reasonable theory and algorithm for doing this. (One issue to
worry about is that small clusters, esp. clusters of size 1,
will have degenerate variance estimates.)
Clustering: I was also vague in lecture about the
initialization step for model-based clustering. One could do a
random initialization, as in k-means, but this tends to be
expensive, since (a) the initial clustering may be very far from
and optimal one, hence requiring many iterations to converge,
(b) it's more likely to converge to a sub-optimal solution,
hence requiring random restarts. And the whole thing is
exacerbated if you want to do the BIC analysis, which requires
doing clustering for many different numbers of clusters and
different choices of model types (equal-volume spherical,
axis-aligned ellipsoids, etc.). Explore using the solution to
the previous problem for this, or invent a better way.
Maximum Likelihood: Let x1, x2, ..., xn be n samples
of a normal random variable X with mean theta1 and variance
theta2. In class I showed that the maximum likelihood estimates
of theta 1 and theta2 when both are unknown give a biased
estimate of theta2. What is the MLE of theta2 if theta1=mu is
assumed to be known? Is it biased?
Maximum Likelihood: Suppose X is a random variable
uniformly distributed between 0 and theta>0 for some unknown
theta. Based on a sample x1, x2, ..., xn of X, what is the
maximum likelihood estimator of theta?
Motif Finding: Pick 10-20 genes from one prokaryotic
organism, say E. coli, and run one of the motif finding
tools we've discussed (MEME, Align-ace, ...) on the 200
base-pair region upstream of each. Does it find anything
interesting, perhaps a TATA box? Repeat with 10-20 human genes.
For a somewhat more ambitious exercise, try Footprinter on 10-20
orthologous genes.
Phylogeny: Hand-simulate Fich's algorithm on a small
example to calculate its parsimony score.
RNA Folding: There are some good, curated databases of
RNA structures & alignments, e.g. for tRNAs and rRNAs, and
the "seed alignments" in Rfam. Pick a few of these sequences,
fold them with an RNA structure prediction program such as the
Vienna RNA package, and compare to the hand-curated structure.
Linear and Quadratic Discriminant Analysis: Suppose
"positive" examples of some trait are drawn from a 2-D Gaussian
distributation with fixed mean and variance/covariance, and that
"negative" examples are drawn from another Gaussian with a
different mean, but the same variance/covariance. Consider a
single point x drawn at random from one or the other of these
distributions. A natural way to "classify" the point as to
whether it is a positive or negative example is to look at the
relative likelihoods of the point being drawn from Gaussian 1 vs
Gaussian 2. There will be points that are equally likely under
the two models. Show that the set of such points is a straight
line, and that all points on one side of the line are more
likely under one of the two models, and all points on the other
side are more likely under the other model. (This is "linear
discriminant analysis" or LDA.) Repeat this assuming the two
gaussians do not have the same variance/covariance
matrix. In this case, the decision boundary will be quadratic
(i.e., this is QDA). Generalization to higher dimension should
be straightforward.