# Clustering

October 18, 2001

Lecturer: Larry Ruzzo

Notes: Tobias Mann

## Clustering History

• Literature traced back to Aristotle
• Linnaeus developed a taxonomy, using qualitative features
• Numerical Taxonomy, by Sokal and Sneath, stressed using numerical measurements and techniques for taxonomy. Published 1963.
• Lots of work in the 60’s and 70’s, less in the 80’s. Microarray data and web data brought a new resurgence in clustering research in the 90’s.

## Clustering Applied to Microarray Data

• Clustering genes by using experimental conditions as features can reveal groups of similar genes
• Clustering conditions by using genes as features can reveal distinctions correlated with gene expression (i.e. like distinguishing different cancer types)
• Clustering genes then features can reveal mutual correlations

## Two Kinds of Clustering

• Model based clustering uses data to estimate model parameters. Choosing an appropriate model for a dataset is difficult. Common models include mixture of Gaussian distributions and mixture of t-distributions.
• Hierarchical Clustering uses a distance function to agglomerate groups of genes. This variety of clustering doesn’t typically have an underlying statistical model. There are many variations of hierarchical clustering.

## Hierarchical Clustering Basic Algorithm

2. Find the most similar pair of clusters, and combine them
3. Repeat step 2 until finished.

This algorithm raises an immediate issue, which is how to determine when the algorithm should terminate. Popular choices include stopping when all data is incorporated into a tree, and stopping when there is a qualitative change in the distance between candidate clusters (for instance, when you go from merging clusters which are all close together to merging clusters which are comparatively quite distant).

## Cluster Distance Functions

Suppose you have two clusters, X (with points xi, 1<i<N), and Y (with points yj,1<j<M). Let d(xi,yj) be a distance function between points in the data set. The following are ways to compute distances between two clusters as a function of the distances between the points in the two clusters.

• Single Link: The single link distance is the minimum over all i and j of d(xi,yj)
• Complete Link: The complete link distance is the maximum over all i and j of d(xi,yj)
• Average Link: The average link distance is the mean over all i and j of d(xi,yj)
• Centroid Link: The centroid link distance is the distance function applied to the centroid of each cluster: d(mean(xi),mean(yj)).

Notes on Hierarchical Clustering Algorithm Behavior

• Single link usually yields sprawling clusters
• Complete link usually yields compact clusters
• Others are in between these extremes
• These algorithms can be unstable when points are added to the dataset
• These algorithms run in approximately O(N2) to O(N3) time, but there are many implementation strategies to speed up computations.

## Hierarchical Algorithm Pros and Cons

Pros:

• Simple and intuitive
• Fast
• Works well on simple data
• deterministic

Cons:

• No model, unclear as to what structure is found
• Algorithms are greedy and can get trapped in local minima