CSE 527
Lecture 7
Non-Hierarchical Clustering
Lecturer: Larry Ruzzo
Notes: Daniel Grossman (grossman@cs)
October 25, 2001
A graph is defined a collection of vertices (or nodes) and edges connecting them. Formally, , where
is a set of unordered pairs from . Figure 1
shows an example graph: edge but edge
.
There are multiple ways of representing a graph in computer memory,
from which we select the adjacency matrix for our discussion.
In an adjacency matrix , if edge exists in
. The main diagonal is sometimes populated with s and sometimes
with s; it makes no difference as long as the graph has no
unit-length cycles. Here is the adjacency matrix corresponding to our
example graph (notice that it is symmetric by definition):
The clique is an important concept in graph
theory. Also known as a complete graph, it is defined as a
graph where every vertex is adjacent to every other. In computational
biology we use cliques as a method of abstracting pairwise
relationships such as protein-protein interaction or gene similarity.
In the latter case, we might want to establish an edge between the
vertices representing two genes if those two genes have, say, similar
expression profiles over several timepoints of an experiment. If we
were to find a clique in such a graph, we would have an important
finding. A goal might be to find the largest clique in the graph.
A useful question to then ask would be,
is there a k-clique in the graph? I.e. is there at least one
subgraph of at least size whose vertices are completely connected
to each other? Here is an algorithm which would allow us to solve
this problem:
Examine all subsets of size and determine if any is a clique.
From a logistical standpoint, the computer should have an easy time
checking whether a graph is a clique: in the adjacency matrix a clique
is represented simply by a subarray containing only s. The problem
with the algorithm is that the number of subsets of size in the
graph is
which is far too many to check one by one. For example, in a graph of
100 vertices, if we would like to search for 10-cliques, we have
subgraphs to examine! Considering that typical gene expression
datasets contain thousands of genes (nodes) and that
investigators are interested in cliques of several dozen, this
algorithm would require an impractical amount of time.
Here is a more specific example of the -clique problem: Given graph
of size , is there a clique of size (i.e. of the
graph)? Since the clique size is a function of the problem input
size, our naive algorithm would require more than subgraph
examinations, meaning that the amount of work we would have to do is
exponential in the size of the input.
In fact, no known algorithm for solving this problem can be proven to
require an amount of work that is less than exponentially
related to the size of the input. The problem is NP-complete,
meaning that it lies in a category of problems for which no polynomial-time (i.e. requiring only an amount of work that is
polynomially related to the problem input size) algorithms are known.
If any problem in this category were shown to be solvable via a P-time
(polynomial-time) algorithm, then that algorithm could be used to
solve all the rest of the NP-complete problems. Since computer
scientists have been looking unsuccessfully for such P-time algorithms
for several decades, it is clear that we must take a different
approach to answer our biological clustering questions.
While the general theoretical problem of identifying cliques may be
intractable, we can solve a simpler problem. Given a graph that
is a clique graph -- i.e. a union of disjoint cliques (see
Figure 2) -- that represents the true state of
whatever similarity/interaction concept we are trying to discover, let
us define , the graph of data that we actually measure, to be the
true graph () plus noise, where the noise is defined by a constant
such that all edges and non-edges are switched on or
turned off in with probability .
Given the true graph it would be trivial to find the cliques in
our data. Unfortunately, our experimental methods measure . So,
can we estimate given ? Is there an algorithm to reconstruct
? In general, we've already shown the answer to be no. A happy
medium that is often acceptable in practice is constructing a clique
graph such that
with probability . That is, we can create a clique
graph that shares more edges with the observed graph than does the
true clique graph. Most importantly, we can do this partial
reconstruction of the true graph in polynomial time.
Let us consider a simple example problem. We have the following
adjacency matrix for the true graph :
We have the following parameters for our algorithm:
- = size of the graph
- = an upper bound on the error rate in the data
- = the probability of failure of our algorithm
- = the size of the smallest clique we are looking to be
able to reconstruct
Because , we expect to see more ones than zeroes in the
clique regions of 's adjacency matrix, and vice versa in the
non-clique regions. Of course, we would never expect to be able to
recover cliques of such small size as those in above which is why
we impose the limit. We do not stipulate that our algorithm
must be able to find even very small cliques, which could easily be
corrupted more than 50% simply by chance.
Before we continue on to a discussion of our algorithm for this
simpler problem, we pause to discuss why we bother to work on this
problem at all. Would we realistically expect to find easy to resolve
clique graphs in the real world, even in the absence of noise? The
answer, of course, is that in biological problems, the relationships
we are trying to learn are so complex that such clique graphs would
indeed not exist. However, solving our simpler problem will help us
find heuristics, answers, and insights for solving the real-life
systems, as discussed in Ben-Dor, Shamir, & Yakhini [JCB '99], the
paper in which the current algorithm was presented.
The bottom line of the ``simpler problem'' discussion is that Ben-Dor
et al. arrive at an
algorithm for any fixed
choice of
, where the constants depend
heavily on those fixed values. For example, for
turns out to be 600. So, while Ben-Dor's algorithm is
polynomial in the size of the input, it is impractical. However, the
theorem governing the validity of the algorithm's output is correct.
Moreover, time bounds in theoretical papers are often conservative,
meaning that the algorithm could perhaps work in practice for some
non-trivial problems.
Omitting several key components of the approach, we focus on the key
insight in the Ben-Dor algorithm, the core. Suppose has
only two cliques, one containing 90% of the data, and the other,
10%. We define a core as a set of vertices all located in one
clique. Now consider some other vertex not in the core. What do
we expect of ? If it is in the same clique as the core, then (in
the noisy graph ) the expected fraction of edges between and
the core is . Conversely, if is not in the same
clique as the core, then the expected fraction of edges connecting the
two is . See Figure 3.
So, given a core, we classify all vertices based on what fraction
of its possible edges to the core actually exist. If has greater
than of its edges connected to the core, then we decide that
is in the same clique. If less than of the possible edges exist
in , we conclude that the vertex lies outside the clique. (Note
that if our core is small to begin with, our shared-edges test does
not have much statistical power.)
Suppose we start with a core of size vertices. What is the
chance of a vertex classification error if ? It is
equivalent to
- the probability of having connectedness to the core when
we expect , or
- the probability of having connectedness to the core when
we expect .
Therefore, by picking an appropriate value of , we can assure that
we only misclassify some constant number of edges.
How do we generate the core to begin with? There is the brute force
option by which we analyze all
cores of size
, but this set of cores is too large. Instead, we perform the
following sampling procedure that seems to work quite well in
practice:
Sample vertices from (probably picks vertices
from both cliques)
Sample
vertices from the first sample
Partition these into cliques
By brute force on these small vertex sets, find potential
members of each core
Build these small cores into larger size cores using the original vertex samples
Is there a real heuristic practical algorithm to which the above
approach leads us? Yes. It is based on the key core attraction
concept.
First we pick a threshold such that we add the current vertex
under consideration to the current component under consideration ()
if the similarity of the vertex to the componet is greater than .
Later on we might want to remove early-placed members of the component
is they no longer have high affinity to the resulting cluster.
We define to be the average affinity of node x to the
current cluster. Now ...
Repeat until everything clustered (freezing the cluster results each loop):
{
Pick the vertex with the highest to members of and add it
to .
Repeat until no more changes (or until we feel like stopping):
{
For each vertex , if then add to
For each , if then remove from
}
}
At the end of the algorithm we perform one final pass during which
every point is compared to all of the clusters and moved to the
cluster with which it has the highest affinity. This corrects for
local optima that may have been encountered along the way.
What does the practical algorithm have to do with the theoretical
algorithm? Not much, aside from the concept of affinity to
cores. We do not want to perform the brute force core enumeration,
so instead we just start somewhere and reserver the right to remove
points from clusters as the algorithm progresses.
CSE 527
Lecture 7
Non-Hierarchical Clustering
This document was generated using the
LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif lecture7notes.tex
The translation was initiated by Jochen Jaeger on 2001-11-08
Jochen Jaeger
2001-11-08