Lecture 7

Non-Hierarchical Clustering

**Lecturer: Larry Ruzzo
Notes: Daniel Grossman (grossman@cs)**

**October 25, 2001**

`
Examine all subsets of size and determine if any is a clique.
`

From a logistical standpoint, the computer should have an easy time checking whether a graph is a clique: in the adjacency matrix a clique is represented simply by a subarray containing only s. The problem with the algorithm is that the number of subsets of size in the graph is

which is far too many to check one by one. For example, in a graph of 100 vertices, if we would like to search for 10-cliques, we have

subgraphs to examine! Considering that typical gene expression datasets contain

Here is a more specific example of the -clique problem: Given graph of size , is there a clique of size (i.e. of the graph)? Since the clique size is a function of the problem input size, our naive algorithm would require more than subgraph examinations, meaning that the amount of work we would have to do is exponential in the size of the input.

In fact, no known algorithm for solving this problem can be proven to
require an amount of work that is *less* than exponentially
related to the size of the input. The problem is **NP-complete**,
meaning that it lies in a category of problems for which no **polynomial-time** (i.e. requiring only an amount of work that is
polynomially related to the problem input size) algorithms are known.
If any problem in this category were shown to be solvable via a P-time
(polynomial-time) algorithm, then that algorithm could be used to
solve all the rest of the NP-complete problems. Since computer
scientists have been looking unsuccessfully for such P-time algorithms
for several decades, it is clear that we must take a different
approach to answer our biological clustering questions.

Given the true graph it would be trivial to find the cliques in
our data. Unfortunately, our experimental methods measure . So,
can we estimate given ? Is there an algorithm to reconstruct
? In general, we've already shown the answer to be no. A happy
medium that is often acceptable in practice is constructing a clique
graph such that

with probability . That is, we can create a clique graph that shares more edges with the observed graph than does the true clique graph. Most importantly, we can do this partial reconstruction of the true graph in polynomial time.

We have the following parameters for our algorithm:

- = size of the graph
- = an upper bound on the error rate in the data
- = the probability of failure of our algorithm
- = the size of the smallest clique we are looking to be able to reconstruct

The bottom line of the ``simpler problem'' discussion is that Ben-Dor
*et al.* arrive at an
algorithm for any fixed
choice of
, where the constants depend
heavily on those fixed values. For example, for
turns out to be 600. So, while Ben-Dor's algorithm is
polynomial in the size of the input, it is impractical. However, the
theorem governing the validity of the algorithm's output is correct.
Moreover, time bounds in theoretical papers are often conservative,
meaning that the algorithm could perhaps work in practice for some
non-trivial problems.

So, given a core, we classify all vertices based on what fraction of its possible edges to the core actually exist. If has greater than of its edges connected to the core, then we decide that is in the same clique. If less than of the possible edges exist in , we conclude that the vertex lies outside the clique. (Note that if our core is small to begin with, our shared-edges test does not have much statistical power.)

Suppose we start with a core of size vertices. What is the chance of a vertex classification error if ? It is equivalent to

- the probability of having connectedness to the core when we expect , or
- the probability of having connectedness to the core when we expect .

`
Sample vertices from (probably picks vertices
from both cliques)
Sample
vertices from the first sample
Partition these into cliques
By brute force on these small vertex sets, find potential
members of each core
Build these small cores into larger size cores using the original vertex samples
`

First we pick a threshold such that we add the current vertex under consideration to the current component under consideration () if the similarity of the vertex to the componet is greater than . Later on we might want to remove early-placed members of the component is they no longer have high affinity to the resulting cluster.

We define to be the average **affinity** of node x to the
current cluster. Now ...

`
Repeat until everything clustered (freezing the cluster results each loop):
{
Pick the vertex with the highest to members of and add it
to .
Repeat until no more changes (or until we feel like stopping):
{
For each vertex , if then add to
For each , if then remove from
}
}
`

At the end of the algorithm we perform one final pass during which
every point is compared to all of the clusters and moved to the
cluster with which it has the highest affinity. This corrects for
local optima that may have been encountered along the way.

What does the practical algorithm have to do with the theoretical
algorithm? Not much, aside from the concept of *affinity to
cores*. We do not want to perform the brute force core enumeration,
so instead we just start somewhere and reserver the right to remove
points from clusters as the algorithm progresses.

Lecture 7

Non-Hierarchical Clustering

This document was generated using the
**LaTeX**2`HTML` translator Version 2K.1beta (1.47)

Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.

Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.

The command line arguments were:

**latex2html** `-split 0 -image_type gif lecture7notes.tex`

The translation was initiated by Jochen Jaeger on 2001-11-08

Jochen Jaeger 2001-11-08