Lecture 8
October 30, 2001
Lecturer: Larry Ruzzo
Notes: Peter Couperus
Recall from last time, we wanted to find a solution to the
following problem. Given a graph , the adjacencies of
two vertices in is toggled with some probability
to give us a new graph . Our only knowledge
is the structure of .
The problem is to reconstruct with ``some
certainty'' where the cliques (maximal connected subgraphs) of the
original graph were. Ideally, one would look for
principal submatrices
(submatrices which are closed under reflection about the diagonal)
of the incidence matrix for which are ``dense in 1s'' and
``sparse in 0s''. If the vertices happen to be ordered so the
cliques have vertex sets which are contiguous with respect to this
ordering, then this problem is ``fairly easy'': we need only look
at square blocks around the diagonal, and see when these are
``dense in 1s'' and ``sparse in 0s''. If not (which
is more likely the case), we would need to examine all principal
submatrices, which is equivalent to looking at the square blocks
around the diagonal for each possible permutation of the vertex ordering.
Since there are lots of permutations, this isn't feasible in practice.
Recall from last time, we discussed one non-heirarchical clustering
method, the ``core-based'' method. In this class, we discuss two
additional (unstructured) clustering heuristics, -means and
Self-Organizing Maps. We also start to discuss
Model Based Clustering.
The basic idea for -means is to begin with our data points segregated
into clusters, and through each pass ``migrate'' points to more
appropriate clusters.
- -means Clustering
- Input: A collection of points in -dimensional space,
and ( is the
number of clusters desired).
- Output: A set of clusters.
- Initialize. Can be done in a variety of ways. Can be
done naively by randomly distrubuting the points among sets.
- For each cluster , calculate the centroid for ,
meaning, calculate the average of each of the coordinates among
the points in .
- With the centroids calculated from (2), for each point, decide
which centroid it is closest to, and reassign the point to the cluster
corresponding to this centroid.
- Repeat until some iteration count is reached, or until some
sort of convergence is reached.
Some comments about the -means heuristic:
The other clustering method described is a bit more technical.
It doesn't exactly seem as much of a ``clustering'' problem as it
seems like a ``find the best centroids'' problem. Comments were
made that in light of this, seemingly ``similar'' data were forced
into different categories for large enough. The basic setup
for the self-organizing map heuristic is pretty flexible though.
- Self-Organizing Maps
- Input: Functions , integer ,
a collection of data points, and some underlying topology (with points).
- Output: centroids. Specifically, a centroid corresponding
to each point in the topology.
A couple of comments/requirements on the setup:
With all of that in mind, the following describes the SOM process:
- Set .
Initialize. Again, many ways to do this. Could be done naively
by picking random data points as the centroids and putting them
at ``appropriate'' places in the topological structure (where, more
arguments could be made as to good was to do this step).
- For each data point (currently in cluster with centroid
),
find the closest centroid to ,
and assign to the corresponding cluster .
Update the centroid (old_center to new_center) for each cluster as follows:
If was in originally, and , leave centroid alone.
Otherwise,
new_center = old_center + ( - old_center), where
= the topological distance from cluster to cluster .
(on a 2D grid, this was the taxi-cab distance between the
two points on the grid).
- Increment . If ``large enough'', stop. Otherwise goto (2).
There are many variants on this heuristic. There is a lot of flexibility
in the initialization stage, in the underlying topology we use,
the number of clusters we desire, the functions and .
Comments were made that the specific choice of and
didn't seem crucial. Suggestions were made that perhaps instead of
restricting the topology from the outset of the algorithm, perhaps
instead build a directed graph based on the centroids chosen in
the initialization stage, and then impose whatever topological constraints
you want on this digraph instead (planarity, -regularity, etc.)
The argument for this idea included:
algorithms for these classes of digraphs have been more thouroughly researched,
and by imposing the topology before we actually have the data, we may be
prematurely forcing conglomeration or segmentation of clusters.
The last type of clustering discussed (and will be gone into more
in depth in the future) was model based clustering. Where all of the
previous clustering methods discussed made no stringent assumptions
on the input data, model based clustering will make some sort of assumptions.
Specifically, the input data may be assumed to have some well known
statistical/mathematical distribution, be it normal, binomial, Poisson,
or whatever. Further, it may be known ahead of time that the data arose
from different stoichastic (random) processes, with some hypothesized
distribution.
Then under these assumption, the goal is to determine
with some probabilistic certainty the mean, median, variance, and
whatever other statistical parameters are desired for each of the
processes. Then, with these parameters ``determined'', given
some data point, we wish to determine the probability it arose
from each of the processes. This will give us a form of Soft
Clustering, where data points are not partitioned into sets, but
instead associated with different processes with some probability.
This is in contrast to the Hard Clustering performed by all
of the methods described thus far: the data points are partitioned
into some collection of sets.
The method that will be used (and discussed) is the EM-algorithm,
which will perform much the same as -means under the assumption that
the processes have spherical symmetry and constant variance.
Lecture 8
This document was generated using the
LaTeX2HTML translator Version 2K.1beta (1.57)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 lec.tex
The translation was initiated by Jochen Jaeger on 2001-11-01