October 18, 2001

Lecturer: Larry Ruzzo

Notes: Tobias Mann

- Literature
traced back to Aristotle
- Linnaeus
developed a taxonomy, using qualitative features
__Numerical Taxonomy__, by Sokal and Sneath, stressed using numerical measurements and techniques for taxonomy. Published 1963.- Lots
of work in the 60’s and 70’s, less in the 80’s. Microarray data and web
data brought a new resurgence in clustering research in the 90’s.

** **

- Clustering
genes by using experimental conditions as features can reveal groups of
similar genes
- Clustering
conditions by using genes as features can reveal distinctions correlated
with gene expression (i.e. like distinguishing different cancer types)
- Clustering
genes then features can reveal mutual correlations

** **

- Model
based clustering uses data to estimate model parameters. Choosing an
appropriate model for a dataset is difficult. Common models include
mixture of Gaussian distributions and mixture of t-distributions.
- Hierarchical
Clustering uses a distance function to agglomerate groups of genes. This
variety of clustering doesn’t typically have an underlying statistical
model. There are many variations of hierarchical clustering.

- Start with each point in its own cluster
- Find the most similar pair of clusters, and combine them
- Repeat step 2 until finished.

This algorithm raises an immediate issue, which is how to determine when the algorithm should terminate. Popular choices include stopping when all data is incorporated into a tree, and stopping when there is a qualitative change in the distance between candidate clusters (for instance, when you go from merging clusters which are all close together to merging clusters which are comparatively quite distant).

Suppose
you have two clusters, X (with points x_{i}, 1<i<N), and Y (with
points y_{j},1<j<M). Let d(x_{i},y_{j}) be a
distance function between points in the data set. The following are ways to
compute distances between two clusters as a function of the distances between
the points in the two clusters.

**Single Link:**The single link distance is the minimum over all i and j of d(x_{i},y_{j})**Complete Link:**The complete link distance is the maximum over all i and j of d(x_{i},y_{j})**Average Link:**The average link distance is the mean over all i and j of d(x_{i},y_{j})**Centroid Link:**The centroid link distance is the distance function applied to the centroid of each cluster: d(mean(x_{i}),mean(y_{j})).

** **

**Notes on Hierarchical Clustering Algorithm Behavior**

** **

- Single
link usually yields sprawling clusters
- Complete
link usually yields compact clusters
- Others
are in between these extremes
- Average
link and complete link are hard to distinguish
- Single
link does badly in practice
- These
algorithms can be unstable when points are added to the dataset
- These
algorithms run in approximately O(N
^{2}) to O(N^{3}) time, but there are many implementation strategies to speed up computations.

Pros:

- Simple
and intuitive
- Fast
- Works
well on simple data
- no
preconception about structure
- deterministic

Cons:

- No
model, unclear as to what structure is found
- Algorithms
are greedy and can get trapped in local minima

** **