# CSE 527 Lecture 7 Non-Hierarchical Clustering

Lecturer: Larry Ruzzo
Notes: Daniel Grossman (grossman@cs)

October 25, 2001

# Brief Discussion of Graph Theory

A graph is defined a collection of vertices (or nodes) and edges connecting them. Formally, , where is a set of unordered pairs from . Figure 1 shows an example graph: edge but edge .

There are multiple ways of representing a graph in computer memory, from which we select the adjacency matrix for our discussion. In an adjacency matrix , if edge exists in . The main diagonal is sometimes populated with s and sometimes with s; it makes no difference as long as the graph has no unit-length cycles. Here is the adjacency matrix corresponding to our example graph (notice that it is symmetric by definition):

#### Cliques

The clique is an important concept in graph theory. Also known as a complete graph, it is defined as a graph where every vertex is adjacent to every other. In computational biology we use cliques as a method of abstracting pairwise relationships such as protein-protein interaction or gene similarity. In the latter case, we might want to establish an edge between the vertices representing two genes if those two genes have, say, similar expression profiles over several timepoints of an experiment. If we were to find a clique in such a graph, we would have an important finding. A goal might be to find the largest clique in the graph.

#### Finding -Cliques

A useful question to then ask would be, is there a k-clique in the graph? I.e. is there at least one subgraph of at least size whose vertices are completely connected to each other? Here is an algorithm which would allow us to solve this problem:

Examine all subsets of size and determine if any is a clique.

From a logistical standpoint, the computer should have an easy time checking whether a graph is a clique: in the adjacency matrix a clique is represented simply by a subarray containing only s. The problem with the algorithm is that the number of subsets of size in the graph is

which is far too many to check one by one. For example, in a graph of 100 vertices, if we would like to search for 10-cliques, we have

subgraphs to examine! Considering that typical gene expression datasets contain thousands of genes (nodes) and that investigators are interested in cliques of several dozen, this algorithm would require an impractical amount of time.

Here is a more specific example of the -clique problem: Given graph of size , is there a clique of size (i.e. of the graph)? Since the clique size is a function of the problem input size, our naive algorithm would require more than subgraph examinations, meaning that the amount of work we would have to do is exponential in the size of the input.

In fact, no known algorithm for solving this problem can be proven to require an amount of work that is less than exponentially related to the size of the input. The problem is NP-complete, meaning that it lies in a category of problems for which no polynomial-time (i.e. requiring only an amount of work that is polynomially related to the problem input size) algorithms are known. If any problem in this category were shown to be solvable via a P-time (polynomial-time) algorithm, then that algorithm could be used to solve all the rest of the NP-complete problems. Since computer scientists have been looking unsuccessfully for such P-time algorithms for several decades, it is clear that we must take a different approach to answer our biological clustering questions.

# Solving Clustering Problems

While the general theoretical problem of identifying cliques may be intractable, we can solve a simpler problem. Given a graph that is a clique graph -- i.e. a union of disjoint cliques (see Figure 2) -- that represents the true state of whatever similarity/interaction concept we are trying to discover, let us define , the graph of data that we actually measure, to be the true graph () plus noise, where the noise is defined by a constant such that all edges and non-edges are switched on or turned off in with probability .

Given the true graph it would be trivial to find the cliques in our data. Unfortunately, our experimental methods measure . So, can we estimate given ? Is there an algorithm to reconstruct ? In general, we've already shown the answer to be no. A happy medium that is often acceptable in practice is constructing a clique graph such that

with probability . That is, we can create a clique graph that shares more edges with the observed graph than does the true clique graph. Most importantly, we can do this partial reconstruction of the true graph in polynomial time.

## Example

Let us consider a simple example problem. We have the following adjacency matrix for the true graph :

We have the following parameters for our algorithm:
• = size of the graph
• = an upper bound on the error rate in the data
• = the probability of failure of our algorithm
• = the size of the smallest clique we are looking to be able to reconstruct
Because , we expect to see more ones than zeroes in the clique regions of 's adjacency matrix, and vice versa in the non-clique regions. Of course, we would never expect to be able to recover cliques of such small size as those in above which is why we impose the limit. We do not stipulate that our algorithm must be able to find even very small cliques, which could easily be corrupted more than 50% simply by chance.

## Practicality

Before we continue on to a discussion of our algorithm for this simpler problem, we pause to discuss why we bother to work on this problem at all. Would we realistically expect to find easy to resolve clique graphs in the real world, even in the absence of noise? The answer, of course, is that in biological problems, the relationships we are trying to learn are so complex that such clique graphs would indeed not exist. However, solving our simpler problem will help us find heuristics, answers, and insights for solving the real-life systems, as discussed in Ben-Dor, Shamir, & Yakhini [JCB '99], the paper in which the current algorithm was presented.

The bottom line of the simpler problem'' discussion is that Ben-Dor et al. arrive at an algorithm for any fixed choice of , where the constants depend heavily on those fixed values. For example, for turns out to be 600. So, while Ben-Dor's algorithm is polynomial in the size of the input, it is impractical. However, the theorem governing the validity of the algorithm's output is correct. Moreover, time bounds in theoretical papers are often conservative, meaning that the algorithm could perhaps work in practice for some non-trivial problems.

## Brief Look at the Algorithm's Inner Workings

Omitting several key components of the approach, we focus on the key insight in the Ben-Dor algorithm, the core. Suppose has only two cliques, one containing 90% of the data, and the other, 10%. We define a core as a set of vertices all located in one clique. Now consider some other vertex not in the core. What do we expect of ? If it is in the same clique as the core, then (in the noisy graph ) the expected fraction of edges between and the core is . Conversely, if is not in the same clique as the core, then the expected fraction of edges connecting the two is . See Figure 3.

So, given a core, we classify all vertices based on what fraction of its possible edges to the core actually exist. If has greater than of its edges connected to the core, then we decide that is in the same clique. If less than of the possible edges exist in , we conclude that the vertex lies outside the clique. (Note that if our core is small to begin with, our shared-edges test does not have much statistical power.)

Suppose we start with a core of size vertices. What is the chance of a vertex classification error if ? It is equivalent to

• the probability of having connectedness to the core when we expect , or
• the probability of having connectedness to the core when we expect .
Therefore, by picking an appropriate value of , we can assure that we only misclassify some constant number of edges.

## Generating the Core

How do we generate the core to begin with? There is the brute force option by which we analyze all cores of size , but this set of cores is too large. Instead, we perform the following sampling procedure that seems to work quite well in practice:

Sample vertices from (probably picks vertices from both cliques)
Sample vertices from the first sample
Partition these into cliques
By brute force on these small vertex sets, find potential members of each core
Build these small cores into larger size cores using the original vertex samples

## The Practical Algorithm

Is there a real heuristic practical algorithm to which the above approach leads us? Yes. It is based on the key core attraction concept.

First we pick a threshold such that we add the current vertex under consideration to the current component under consideration () if the similarity of the vertex to the componet is greater than . Later on we might want to remove early-placed members of the component is they no longer have high affinity to the resulting cluster.

We define to be the average affinity of node x to the current cluster. Now ...

Repeat until everything clustered (freezing the cluster results each loop):
{
Pick the vertex with the highest to members of and add it to .
Repeat until no more changes (or until we feel like stopping):
{
For each vertex , if then add to
For each , if then remove from
}
}

At the end of the algorithm we perform one final pass during which every point is compared to all of the clusters and moved to the cluster with which it has the highest affinity. This corrects for local optima that may have been encountered along the way.

What does the practical algorithm have to do with the theoretical algorithm? Not much, aside from the concept of affinity to cores. We do not want to perform the brute force core enumeration, so instead we just start somewhere and reserver the right to remove points from clusters as the algorithm progresses.

CSE 527
Lecture 7
Non-Hierarchical Clustering

This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -image_type gif lecture7notes.tex

The translation was initiated by Jochen Jaeger on 2001-11-08

Jochen Jaeger 2001-11-08