Last lecture we discussed the hierarchical clustering algorithm. Before we continue, we review the basics of algorithm runtime performance analysis.
In the early days, the notion of algorithm efficiency was based on the time consumed by the algorithm implementation. This is not a good measure of performance, since it depends on the implementation quality, collection of data sets, and computing power, so it cannot be used for benchmarking and comparing algorithms. Instead, we use a mathematical model to estimate the performance efficiency of an algorithm.
This is a mathematical method for performance measurement in terms of:
So once we conceive the algorithm, to get an idea about its performance we estimate the number of basic operations it does as a function of problem size.
Suppose we have a problem where we have three algorithms to solve it. Each algorithm has a different performance characteristic: n3, 100n2, 1000n. The question is, which one is better? The answer depends on n. If we plot these functions vs. n, we see that depending on which region of n we consider we may choose a different algorithm for better performance. In general though, for large n, the algorithm whose growth rate is lower on n will dominate. The notation can sometimes be misleading because it ignores the constant factor. For example, if we have n3 and (10100)n, at first glance we may think the second one is better (O(n3) vs. O(n)), but it turns out that n3 is better unless n is really large.
An advantage of using this method is that we do not need to be very precise about the basic operations that the algorithm performs.
There are a number of functions that are seen frequently in algorithm analysis:
The clustering algorithm we talked about last time was called Hierarchical Agglomerative Clustering (HAG). This algorithm assumes that every point is a cluster by itself. The algorithm finds the closest clusters and merges them.
There is a similar algorithm called Hierarchical Divisive Clustering, where we start with a big cluster, and divide it into smaller clusters iteratively.
A Naïve Algorithm for HAG:
i. it takes O(n2) to calculate the distance, and O(n2) to find the min
Þ This algorithm is O(n3)
We can improve this algorithm using single link for distance function as follows:
Note: This algorithm looks awfully like Kruskal’s algorithm for finding minimum spanning tree in a graph.
Comment in class: This algorithm is not cache-friendly and is not parallel.
Steps 1 and 2 above are done in O(n2)
Steps 3-6 are all O(n) and they are repeated n-1 times
Þ This algorithm has runtime performance of O(n2)
Now let’s consider the same algorithm with complete link as the distance function. This means to calculate the distance, we look at the farthest points in the cluster. Otherwise, the algorithm remains the same. In other words, for single link we considered min(min(rows)), whereas for complete link we consider max(min(rows)). Once we merge the rows, the distance from k to the new row (merged i and j) is max(dik, djk). This however introduces a complication. What if the row min came from one of the columns that merged? This means we need to recalculate the min, which hinders our performance. The best algorithm that Dr. Ruzzo has come across has O(n2logn), but O(n2) may exist.
The algorithm uses a data structure called Heap:
Using the Heap structure, the algorithm is changed to:
Þ This algorithm performs in O(n2log n)
Note 1: this algorithm has n2 memory requirement, which is nasty. This is because we have a heap of size n that we use to insert and find mins quickly, and an array to correlate the min back to the row.
Note 2: if we have d dimensions, the performance becomes n2d +n2logn