Lecture 8

October 30, 2001
Lecturer: Larry Ruzzo
Notes: Peter Couperus

More Clustering

Finding Cliques from Last Time

Recall from last time, we wanted to find a solution to the following problem. Given a graph

, the adjacencies of two vertices

is toggled with some probability $p < \frac{1}{2}$ to give us a new graph

. Our only knowledge is the structure of

. The problem is to reconstruct with ``some certainty'' where the cliques (maximal connected subgraphs) of the original graph

were. Ideally, one would look for principal submatrices (submatrices which are closed under reflection about the diagonal) of the incidence matrix for

which are ``dense in 1s'' and ``sparse in 0s''. If the vertices happen to be ordered so the cliques have vertex sets which are contiguous with respect to this ordering, then this problem is ``fairly easy'': we need only look at square blocks around the diagonal, and see when these are ``dense in 1s'' and ``sparse in 0s''. If not (which is more likely the case), we would need to examine all principal submatrices, which is equivalent to looking at the square blocks around the diagonal for each possible permutation of the vertex ordering. Since there are lots of permutations, this isn't feasible in practice.

Other Clustering Heuristics

Recall from last time, we discussed one non-heirarchical clustering method, the ``core-based'' method. In this class, we discuss two additional (unstructured) clustering heuristics,

-means and Self-Organizing Maps. We also start to discuss Model Based Clustering.

K-Means

The basic idea for

-means is to begin with our data points segregated into

clusters, and through each pass ``migrate'' points to more appropriate clusters.

-means Clustering
Input: A collection of points in -dimensional space, and ( is the number of clusters desired).
Output: A set of $\le K$ clusters.

Initialize. Can be done in a variety of ways. Can be done naively by randomly distrubuting the points among sets.
For each cluster , calculate the centroid for , meaning, calculate the average of each of the coordinates among the points in .
With the centroids calculated from (2), for each point, decide which centroid it is closest to, and reassign the point to the cluster corresponding to this centroid.
Repeat until some iteration count is reached, or until some sort of convergence is reached.

Some comments about the

-means heuristic:

After the first pass, the data points are fairly well aggregated about the different centroids.
The resulting clusters may be heavily dependent on the initialization step. Namely, if a different collection of random points is chosen at the beginning, we may end up with a different clustering partition at the end.
In effect, this heuristic strives to minimize the total intra-cluster distance.
Meaning, minimize:

$\begin{displaymath} \sum_{clusters\ C}(\sum_{(i,j) \in C^2} \vert x_i - x_j\vert^2) = \alpha \sum_{clusters\ C}(\sum_{i \in C} \vert y_i\vert^2) \end{displaymath}$

( $\alpha$ a constant) where $\vert y_i\vert = \vert x_i - \bar x^{(C(i))}\vert$ is the distance from to the centroid of the cluster it is in. ( $\bar x^{(C(i))}$ is the centroid of the cluster containing data point ). In essence, this is intuitively clear: we'd want to be able to move the centroids as close as possible to the data points.
This solution is not heirarchical. It is more of a global optimization problem...minimizing the sum above over all possible collections of clusters. Since there are scads of collections of clusters, (and minimizing sums of this type may be ${\bf NP}$ -hard?), we can attempt to get as close as our patience allows us by setting the iteration limit.
The clusters we get depends on , specifically, as increases, the diameter of the clusters decreases. So, when passes the number of data points, we can put each data point in its own cluster, so the above sum is 0.
It is possible that we end up with less than clusters by the time termination occurs. Perhaps, during one pass, for each data point in a specific cluster, there is a different centroid which that point is closer to. Then, after this pass, the cluster having these points originally will no longer have any points in it.

Self-Organizing Maps

The other clustering method described is a bit more technical. It doesn't exactly seem as much of a ``clustering'' problem as it seems like a ``find the best

centroids'' problem. Comments were made that in light of this, seemingly ``similar'' data were forced into different categories for

large enough. The basic setup for the self-organizing map heuristic is pretty flexible though.

Self-Organizing Maps
Input: Functions $\alpha,\sigma$ , integer , a collection of data points, and some underlying topology (with points).
Output: centroids. Specifically, a centroid corresponding to each point in the topology.

A couple of comments/requirements on the setup:

$\begin{displaymath}\lim_{t \to \infty}\alpha(t) =\lim_{t \to \infty}\sigma(t) = 0\end{displaymath}$
$\alpha(t)$ represents (roughly) ``how much a centroid is allowed to be moved'' on pass .
$\sigma(t)$ represents (roughly) ``how much we want to move centroids connected in the topology'' on pass .
In lecture, the underlying topology used for illustrative purposes was a grid.

With all of that in mind, the following describes the SOM process:

Set . Initialize. Again, many ways to do this. Could be done naively by picking random data points as the centroids and putting them at ``appropriate'' places in the topological structure (where, more arguments could be made as to good was to do this step).
For each data point (currently in cluster with centroid $\bar x$ ), find the closest centroid to , and assign to the corresponding cluster . Update the centroid (old_center to new_center) for each cluster as follows:
If was in originally, and , leave centroid alone. Otherwise, new_center = old_center + ( - old_center), where

$\begin{displaymath} h(t) = \alpha(t)e^{-d^2/(\sigma(t))^2} \end{displaymath}$

= the topological distance from cluster to cluster . (on a 2D grid, this was the taxi-cab distance between the two points on the grid).
Increment . If ``large enough'', stop. Otherwise goto (2).

There are many variants on this heuristic. There is a lot of flexibility in the initialization stage, in the underlying topology we use, the number of clusters we desire, the functions $\alpha$ and $\sigma$ . Comments were made that the specific choice of $\alpha$ and $\sigma$ didn't seem crucial. Suggestions were made that perhaps instead of restricting the topology from the outset of the algorithm, perhaps instead build a directed graph based on the centroids chosen in the initialization stage, and then impose whatever topological constraints you want on this digraph instead (planarity,

-regularity, etc.) The argument for this idea included: algorithms for these classes of digraphs have been more thouroughly researched, and by imposing the topology before we actually have the data, we may be prematurely forcing conglomeration or segmentation of clusters.

Model Based Clustering

The last type of clustering discussed (and will be gone into more in depth in the future) was model based clustering. Where all of the previous clustering methods discussed made no stringent assumptions on the input data, model based clustering will make some sort of assumptions. Specifically, the input data may be assumed to have some well known statistical/mathematical distribution, be it normal, binomial, Poisson, or whatever. Further, it may be known ahead of time that the data arose from

different stoichastic (random) processes, with some hypothesized distribution. Then under these assumption, the goal is to determine with some probabilistic certainty the mean, median, variance, and whatever other statistical parameters are desired for each of the

processes. Then, with these parameters ``determined'', given some data point, we wish to determine the probability it arose from each of the

processes. This will give us a form of Soft Clustering, where data points are not partitioned into sets, but instead associated with different processes with some probability. This is in contrast to the Hard Clustering performed by all of the methods described thus far: the data points are partitioned into some collection of sets. The method that will be used (and discussed) is the EM-algorithm, which will perform much the same as

-means under the assumption that the

processes have spherical symmetry and constant variance.

About this document ...

Lecture 8

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.57)

The command line arguments were:
latex2html -split 0 lec.tex

The translation was initiated by Jochen Jaeger on 2001-11-01