# Clustering (Continued)

### October 23, 2001 Lecturer: Larry Ruzzo Notes: Shabnam Erfani

Last lecture we discussed the hierarchical clustering algorithm. Before we continue, we review the basics of algorithm runtime performance analysis.

In the early days, the notion of algorithm efficiency was based on the time consumed by the algorithm implementation. This is not a good measure of performance, since it depends on the implementation quality, collection of data sets, and computing power, so it cannot be used for benchmarking and comparing algorithms. Instead, we use a mathematical model to estimate the performance efficiency of an algorithm.

# Asymptotic Analysis of Computational Complexity

This is a mathematical method for performance measurement in terms of:

• Problem size (e.g. number of points, number of measurements) n
• Basic computing operations, what a computer can do in a unit of work. Typical basic operations are: +, -, *, /, compare, index into a table
• Growth rate of number of operations as a function of n. We use a notation called big-O notation to indicate this. For example, an algorithm that has a growth rate of n2, is annotated as O(n2).

So once we conceive the algorithm, to get an idea about its performance we estimate the number of basic operations it does as a function of problem size.

Suppose we have a problem where we have three algorithms to solve it. Each algorithm has a different performance characteristic: n3, 100n2, 1000n. The question is, which one is better? The answer depends on n. If we plot these functions vs. n, we see that depending on which region of n we consider we may choose a different algorithm for better performance. In general though, for large n, the algorithm whose growth rate is lower on n will dominate. The notation can sometimes be misleading because it ignores the constant factor. For example, if we have n3 and (10100)n, at first glance we may think the second one is better (O(n3) vs. O(n)), but it turns out that n3 is better unless n is really large.

An advantage of using this method is that we do not need to be very precise about the basic operations that the algorithm performs.

There are a number of functions that are seen frequently in algorithm analysis:

• Log n: is a function that grows without bound, but slowly
• N*log n
• O(n) < O(nlog n) < O(n2)
• 2n, en, 10n, n! are functions that grow faster than polynomials
• O (functions above)   >> O(nk )for any k

# Clustering (Continued from last lecture)

The clustering algorithm we talked about last time was called Hierarchical Agglomerative Clustering (HAG).  This algorithm assumes that every point is a cluster by itself. The algorithm finds the closest clusters and merges them.

There is a similar algorithm called Hierarchical Divisive Clustering, where we start with a big cluster, and divide it into smaller clusters iteratively.

A Naďve Algorithm for HAG:

1. Find closest pair of clusters.
1. Need to know the distances between all pairs and find the one with minimum

i.      it takes O(n2) to calculate the distance, and O(n2) to find the min

1. Merge
2. Repeat (n-1 times)

Ţ This algorithm is O(n3)

We can improve this algorithm using single link for distance function as follows:

1. Calculate the distances between pairs of notes and store in an (n x n) matrix such that the value stored in row i, column j is the distance from cluster i to j, i.e.
1. dij= distance from element i to j.
2. This is a symmetric matrix, assuming the distance function is reflexive symmetric. Even if it is not, we can take the avg of dij and dji to make it symmetric.
3. O(n2) to perform this calculation
2. Calculate the min in each row of the matrix and store
3. Find the minimum of the row mins
1. This gives the closest pair (row and column) to merge
2. If we end up merging rows, most of the table remains unchanged
3. The distance to k (resulting from merging i and j) is min(dik, djk). We build a new row with the proper values filled in
4. Delete rows i and j and insert the new row overlaying i
1. To keep the symmetry in the matrix, we need to update the columns the same way
5. Compact the matrix by copying the last row/column into j (this step will save a bit on the amount of memory required by the algorithm)
6. Update the row mins by comparing to new column, and get the row min for the new row
7. Repeat (goto 3)

Note: This algorithm looks awfully like Kruskal’s algorithm for finding minimum spanning tree in a graph.

Comment in class: This algorithm is not cache-friendly and is not parallel.

Steps 1 and 2 above are done in O(n2)

Steps 3-6 are all O(n) and they are repeated n-1 times

Ţ This algorithm has runtime performance of O(n2)

Now let’s consider the same algorithm with complete link as the distance function. This means to calculate the distance, we look at the farthest points in the cluster. Otherwise, the algorithm remains the same. In other words, for single link we considered min(min(rows)), whereas for complete link we consider max(min(rows)). Once we merge the rows, the distance from k to the new row (merged i and j) is max(dik, djk). This however introduces a complication. What if the row min came from one of the columns that merged? This means we need to recalculate the min, which hinders our performance. The best algorithm that Dr. Ruzzo has come across has O(n2logn), but O(n2) may exist.

The algorithm uses a data structure called Heap:

• It stores n items
• Items can be inserted and deleted
• Min item can be found
• All these operations can be done in O(log n)

Using the Heap structure, the algorithm is changed to:

1. Calculate distance
1. O(log n)
2. Calculate min of each row and insert in Heap
1. O(nlog n)
3. Find min of all row mins
4. Delete rows i and j
5. Insert new row in the Heap
1. no update or compacting needed, Heap does it
6. Repeat

Ţ This algorithm performs in O(n2log n)

Note 1: this algorithm has n2 memory requirement, which is nasty. This is because we have a heap of size n that we use to insert and find mins quickly, and an array to correlate the min back to the row.

Note 2: if we have d dimensions, the performance becomes n2d +n2logn