Lecturer: Larry Ruzzo

Notes: Shabnam Erfani

Last lecture we discussed the hierarchical clustering algorithm. Before we continue, we review the basics of algorithm runtime performance analysis.

In the early days, the notion of algorithm efficiency was based on the time consumed by the algorithm implementation. This is not a good measure of performance, since it depends on the implementation quality, collection of data sets, and computing power, so it cannot be used for benchmarking and comparing algorithms. Instead, we use a mathematical model to estimate the performance efficiency of an algorithm.

This is a mathematical method for performance measurement in terms of:

- Problem size (e.g. number of points, number of measurements) n
- Basic computing operations, what a computer can do in a unit of work. Typical basic operations are: +, -, *, /, compare, index into a table
- Growth
rate of number of operations as a function of n. We use a notation called
big-O notation to indicate this. For example, an algorithm that has a
growth rate of n
^{2}, is annotated as**O**(n^{2}).

So once we conceive the algorithm, to get an idea about its performance we estimate the number of basic operations it does as a function of problem size.

Suppose we have a problem where we have three algorithms
to solve it. Each algorithm has a different performance characteristic: n^{3},
100n^{2},
1000n. The question is, which one is better? The answer depends on n. If we
plot these functions vs. n, we see that depending on which region of n we
consider we may choose a different algorithm for better performance. In general
though, for large n, the algorithm whose growth rate is lower on n will
dominate. The notation can sometimes be misleading because it ignores the
constant factor. For example, if we have n^{3} and (10^{100})n,
at first glance we may think the second one is better (O(n^{3}) vs.
O(n)), but it turns out that n^{3} is better unless n is really large.

An advantage of using this method is that we do not need to be very precise about the basic operations that the algorithm performs.

There are a number of functions that are seen frequently in algorithm analysis:

- Log n: is a function that grows without bound, but slowly
- N*log n
**O**(n) <**O**(nlog n) <**O**(n^{2})- 2
^{n}, e^{n}, 10^{n}, n! are functions that grow faster than polynomials **O**(functions above) >>**O**(n^{k})for any k

The clustering algorithm we talked about last time was called Hierarchical Agglomerative Clustering (HAG). This algorithm assumes that every point is a cluster by itself. The algorithm finds the closest clusters and merges them.

There is a similar algorithm called Hierarchical Divisive Clustering, where we start with a big cluster, and divide it into smaller clusters iteratively.

A Naïve Algorithm for HAG:

- Find closest pair of clusters.
- Need to know the distances between all pairs and find the one with minimum

i.
it takes **O**(n^{2}) to calculate the
distance, and **O**(n^{2}) to find the min

- Merge
- Repeat (n-1 times)

Þ This algorithm is **O**(n^{3})

We can improve this algorithm using single link for distance function as follows:

- Calculate the distances between pairs of notes and store in an (n x n) matrix such that the value stored in row i, column j is the distance from cluster i to j, i.e.
- d
_{ij}= distance from element i to j. - This is a symmetric matrix, assuming the distance
function is reflexive symmetric. Even if it is not, we can take the avg
of d
_{ij}and d_{ji}to make it symmetric. **O**(n^{2}) to perform this calculation- Calculate the min in each row of the matrix and store
- Find the minimum of the row mins
- This gives the closest pair (row and column) to merge
- If we end up merging rows, most of the table remains unchanged
- The distance to k (resulting from merging i and j)
is min(d
_{ik}, d_{jk}). We build a new row with the proper values filled in - Delete rows i and j and insert the new row overlaying i
- To keep the symmetry in the matrix, we need to update the columns the same way
- Compact the matrix by copying the last row/column into j (this step will save a bit on the amount of memory required by the algorithm)
- Update the row mins by comparing to new column, and get the row min for the new row
- Repeat (goto 3)

Note: This algorithm looks awfully like Kruskal’s algorithm for finding minimum spanning tree in a graph.

Comment in class: This algorithm is not cache-friendly and is not parallel.

Steps 1 and 2 above are done in **O**(n^{2})

Steps 3-6 are all **O**(n) and
they are repeated n-1 times

Þ This algorithm has
runtime performance of **O**(n^{2})

Now let’s consider the same
algorithm with complete link as the distance function. This means to calculate
the distance, we look at the farthest points in the cluster. Otherwise, the
algorithm remains the same. In other words, for single link we considered
min(min(rows)), whereas for complete link we consider max(min(rows)). Once we
merge the rows, the distance from k to the new row (merged i and j) is max(d_{ik}, d_{jk}).
This however introduces a complication. What if the row min came from one of
the columns that merged? This means we need to recalculate the min, which
hinders our performance. The best algorithm that Dr. Ruzzo has come across has **O**(n^{2}logn),
but **O**(n^{2})
may exist.

The algorithm uses a data structure called Heap:

- It stores n items
- Items can be inserted and deleted
- Min item can be found
- All these operations can be done in
**O**(log n)

Using the Heap structure, the algorithm is changed to:

- Calculate distance
**O**(log n)- Calculate min of each row and insert in Heap
**O**(nlog n)- Find min of all row mins
- Delete rows i and j
- Insert new row in the Heap
- no update or compacting needed, Heap does it
- Repeat

Þ This
algorithm performs in **O**(n^{2}log n)

Note 1: this
algorithm has n^{2}
memory requirement, which is nasty. This is because we have a heap of size n
that we use to insert and find mins quickly, and an array to correlate the min
back to the row.

Note 2: if we
have d dimensions, the performance becomes n^{2}d +n^{2}logn