Lecture 5

Clustering

October 18, 2001

Lecturer: Larry Ruzzo

Notes: Tobias Mann

Clustering History

Literature traced back to Aristotle
Linnaeus developed a taxonomy, using qualitative features
Numerical Taxonomy, by Sokal and Sneath, stressed using numerical measurements and techniques for taxonomy. Published 1963.
Lots of work in the 60’s and 70’s, less in the 80’s. Microarray data and web data brought a new resurgence in clustering research in the 90’s.

Clustering Applied to Microarray Data

Clustering genes by using experimental conditions as features can reveal groups of similar genes
Clustering conditions by using genes as features can reveal distinctions correlated with gene expression (i.e. like distinguishing different cancer types)
Clustering genes then features can reveal mutual correlations

Two Kinds of Clustering

Model based clustering uses data to estimate model parameters. Choosing an appropriate model for a dataset is difficult. Common models include mixture of Gaussian distributions and mixture of t-distributions.
Hierarchical Clustering uses a distance function to agglomerate groups of genes. This variety of clustering doesn’t typically have an underlying statistical model. There are many variations of hierarchical clustering.

Hierarchical Clustering Basic Algorithm

Start with each point in its own cluster
Find the most similar pair of clusters, and combine them
Repeat step 2 until finished.

This algorithm raises an immediate issue, which is how to determine when the algorithm should terminate. Popular choices include stopping when all data is incorporated into a tree, and stopping when there is a qualitative change in the distance between candidate clusters (for instance, when you go from merging clusters which are all close together to merging clusters which are comparatively quite distant).

Cluster Distance Functions

Suppose you have two clusters, X (with points x_i, 1<i<N), and Y (with points y_j,1<j<M). Let d(x_i,y_j) be a distance function between points in the data set. The following are ways to compute distances between two clusters as a function of the distances between the points in the two clusters.

Single Link: The single link distance is the minimum over all i and j of d(x_i,y_j)
Complete Link: The complete link distance is the maximum over all i and j of d(x_i,y_j)
Average Link: The average link distance is the mean over all i and j of d(x_i,y_j)
Centroid Link: The centroid link distance is the distance function applied to the centroid of each cluster: d(mean(x_i),mean(y_j)).

Notes on Hierarchical Clustering Algorithm Behavior

Single link usually yields sprawling clusters
Complete link usually yields compact clusters
Others are in between these extremes
Average link and complete link are hard to distinguish
Single link does badly in practice
These algorithms can be unstable when points are added to the dataset
These algorithms run in approximately O(N²) to O(N³) time, but there are many implementation strategies to speed up computations.

Hierarchical Algorithm Pros and Cons

Pros:

Simple and intuitive
Fast
Works well on simple data
no preconception about structure
deterministic

Cons:

No model, unclear as to what structure is found
Algorithms are greedy and can get trapped in local minima