October 18, 2001
Lecturer: Larry Ruzzo
Notes: Tobias Mann
traced back to Aristotle
developed a taxonomy, using qualitative features
Taxonomy, by Sokal and Sneath, stressed using numerical measurements
and techniques for taxonomy. Published 1963.
of work in the 60ís and 70ís, less in the 80ís. Microarray data and web
data brought a new resurgence in clustering research in the 90ís.
Clustering Applied to Microarray Data
genes by using experimental conditions as features can reveal groups of
conditions by using genes as features can reveal distinctions correlated
with gene expression (i.e. like distinguishing different cancer types)
genes then features can reveal mutual correlations
Two Kinds of Clustering
based clustering uses data to estimate model parameters. Choosing an
appropriate model for a dataset is difficult. Common models include
mixture of Gaussian distributions and mixture of t-distributions.
Clustering uses a distance function to agglomerate groups of genes. This
variety of clustering doesnít typically have an underlying statistical
model. There are many variations of hierarchical clustering.
Hierarchical Clustering Basic Algorithm
- Start with each point in its own cluster
- Find the most similar pair of clusters, and combine them
- Repeat step 2 until finished.
This algorithm raises an
immediate issue, which is how to determine when the algorithm should terminate.
Popular choices include stopping when all data is incorporated into a tree, and
stopping when there is a qualitative change in the distance between candidate
clusters (for instance, when you go from merging clusters which are all close
together to merging clusters which are comparatively quite distant).
Cluster Distance Functions
you have two clusters, X (with points xi, 1<i<N), and Y (with
points yj,1<j<M). Let d(xi,yj) be a
distance function between points in the data set. The following are ways to
compute distances between two clusters as a function of the distances between
the points in the two clusters.
- Single Link: The single link
distance is the minimum over all i and j of d(xi,yj)
- Complete Link: The complete link
distance is the maximum over all i and j of d(xi,yj)
- Average Link: The average link
distance is the mean over all i and j of d(xi,yj)
- Centroid Link: The centroid link
distance is the distance function applied to the centroid of each cluster:
Notes on Hierarchical Clustering Algorithm Behavior
link usually yields sprawling clusters
link usually yields compact clusters
are in between these extremes
link and complete link are hard to distinguish
link does badly in practice
algorithms can be unstable when points are added to the dataset
algorithms run in approximately O(N2) to O(N3) time,
but there are many implementation strategies to speed up computations.
Hierarchical Algorithm Pros and Cons
well on simple data
preconception about structure
model, unclear as to what structure is found
are greedy and can get trapped in local minima
††††††††††† ††††††††††† †††††††††††