Lecturer: Larry Ruzzo
Notes: Daniel Grossman (grossman@cs)
October 25, 2001
Examine all subsets of size and determine if any is a clique.
From a logistical standpoint, the computer should have an easy time
checking whether a graph is a clique: in the adjacency matrix a clique
is represented simply by a subarray containing only s. The problem
with the algorithm is that the number of subsets of size
in the
graph is
Here is a more specific example of the -clique problem: Given graph
of size
, is there a clique of size
(i.e.
of the
graph)? Since the clique size is a function of the problem input
size, our naive algorithm would require more than
subgraph
examinations, meaning that the amount of work we would have to do is
exponential in the size of the input.
In fact, no known algorithm for solving this problem can be proven to require an amount of work that is less than exponentially related to the size of the input. The problem is NP-complete, meaning that it lies in a category of problems for which no polynomial-time (i.e. requiring only an amount of work that is polynomially related to the problem input size) algorithms are known. If any problem in this category were shown to be solvable via a P-time (polynomial-time) algorithm, then that algorithm could be used to solve all the rest of the NP-complete problems. Since computer scientists have been looking unsuccessfully for such P-time algorithms for several decades, it is clear that we must take a different approach to answer our biological clustering questions.
Given the true graph it would be trivial to find the cliques in
our data. Unfortunately, our experimental methods measure
. So,
can we estimate
given
? Is there an algorithm to reconstruct
? In general, we've already shown the answer to be no. A happy
medium that is often acceptable in practice is constructing a clique
graph
such that
The bottom line of the ``simpler problem'' discussion is that Ben-Dor
et al. arrive at an
algorithm for any fixed
choice of
, where the constants depend
heavily on those fixed values. For example, for
turns out to be 600. So, while Ben-Dor's algorithm is
polynomial in the size of the input, it is impractical. However, the
theorem governing the validity of the algorithm's output is correct.
Moreover, time bounds in theoretical papers are often conservative,
meaning that the algorithm could perhaps work in practice for some
non-trivial problems.
So, given a core, we classify all vertices based on what fraction
of its possible edges to the core actually exist. If
has greater
than
of its edges connected to the core, then we decide that
is in the same clique. If less than
of the possible edges exist
in
, we conclude that the vertex lies outside the clique. (Note
that if our core is small to begin with, our shared-edges test does
not have much statistical power.)
Suppose we start with a core of size vertices. What is the
chance of a vertex classification error if
? It is
equivalent to
Sample vertices from
(probably picks vertices
from both cliques)
Sample
vertices from the first sample
Partition these into cliques
By brute force on these small vertex sets, find potential
members of each core
Build these small cores into larger size cores using the original vertex samples
First we pick a threshold such that we add the current vertex
under consideration to the current component under consideration (
)
if the similarity of the vertex to the componet is greater than
.
Later on we might want to remove early-placed members of the component
is they no longer have high affinity to the resulting cluster.
We define to be the average affinity of node x to the
current cluster. Now ...
Repeat until everything clustered (freezing the cluster results each loop):
{
Pick the vertex with the highest to members of
and add it
to
.
Repeat until no more changes (or until we feel like stopping):
{
For each vertex , if
then add
to
For each , if
then remove
from
}
}
At the end of the algorithm we perform one final pass during which
every point is compared to all of the clusters and moved to the
cluster with which it has the highest affinity. This corrects for
local optima that may have been encountered along the way.
What does the practical algorithm have to do with the theoretical algorithm? Not much, aside from the concept of affinity to cores. We do not want to perform the brute force core enumeration, so instead we just start somewhere and reserver the right to remove points from clusters as the algorithm progresses.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -image_type gif lecture7notes.tex
The translation was initiated by Jochen Jaeger on 2001-11-08