Lecture 5
Clustering
October 18, 2001
Lecturer: Larry Ruzzo
Notes: Tobias Mann
Clustering History
- Literature
traced back to Aristotle
- Linnaeus
developed a taxonomy, using qualitative features
- Numerical
Taxonomy, by Sokal and Sneath, stressed using numerical measurements
and techniques for taxonomy. Published 1963.
- Lots
of work in the 60’s and 70’s, less in the 80’s. Microarray data and web
data brought a new resurgence in clustering research in the 90’s.
Clustering Applied to Microarray Data
- Clustering
genes by using experimental conditions as features can reveal groups of
similar genes
- Clustering
conditions by using genes as features can reveal distinctions correlated
with gene expression (i.e. like distinguishing different cancer types)
- Clustering
genes then features can reveal mutual correlations
Two Kinds of Clustering
- Model
based clustering uses data to estimate model parameters. Choosing an
appropriate model for a dataset is difficult. Common models include
mixture of Gaussian distributions and mixture of t-distributions.
- Hierarchical
Clustering uses a distance function to agglomerate groups of genes. This
variety of clustering doesn’t typically have an underlying statistical
model. There are many variations of hierarchical clustering.
Hierarchical Clustering Basic Algorithm
- Start with each point in its own cluster
- Find the most similar pair of clusters, and combine them
- Repeat step 2 until finished.
This algorithm raises an
immediate issue, which is how to determine when the algorithm should terminate.
Popular choices include stopping when all data is incorporated into a tree, and
stopping when there is a qualitative change in the distance between candidate
clusters (for instance, when you go from merging clusters which are all close
together to merging clusters which are comparatively quite distant).
Cluster Distance Functions
Suppose
you have two clusters, X (with points xi, 1<i<N), and Y (with
points yj,1<j<M). Let d(xi,yj) be a
distance function between points in the data set. The following are ways to
compute distances between two clusters as a function of the distances between
the points in the two clusters.
- Single Link: The single link
distance is the minimum over all i and j of d(xi,yj)
- Complete Link: The complete link
distance is the maximum over all i and j of d(xi,yj)
- Average Link: The average link
distance is the mean over all i and j of d(xi,yj)
- Centroid Link: The centroid link
distance is the distance function applied to the centroid of each cluster:
d(mean(xi),mean(yj)).
Notes on Hierarchical Clustering Algorithm Behavior
- Single
link usually yields sprawling clusters
- Complete
link usually yields compact clusters
- Others
are in between these extremes
- Average
link and complete link are hard to distinguish
- Single
link does badly in practice
- These
algorithms can be unstable when points are added to the dataset
- These
algorithms run in approximately O(N2) to O(N3) time,
but there are many implementation strategies to speed up computations.
Hierarchical Algorithm Pros and Cons
Pros:
- Simple
and intuitive
- Fast
- Works
well on simple data
- no
preconception about structure
- deterministic
Cons:
- No
model, unclear as to what structure is found
- Algorithms
are greedy and can get trapped in local minima