Search Trees

Merge sort, binary search trees, and tries.

Merge Sort
Binary Search Trees
Tries
1. Ternary search trees
2. Memory hierarchy

Merge Sort

Analyze the runtime of a recursive algorithm using recurrences.
Trace the recursive execution of merge sort.
Identify a run of merge sort on an array.

Sequential search returns the index of an element in an array in worst case linear time by scanning across the array and comparing each element to the target. Although linear time algorithms are much more efficient than quadratic time algorithms, there are many situations where we need to make a large number of searches on a large amount of data.

A linear time algorithm like worst case sequential search might have the following real-world runtimes.

When N = 1 million, about 1 second.
When N = 10 million, about 10 seconds.
When N = 100 million, about 100 seconds.

Imagine how differently we would interact with technologies if search results took 100 or more seconds to process. Sorting elements is one way to enable more efficient searching.

Binary search returns the index of an element in a sorted array in worst case logarithmic time by using the sorted order to discard half the remaining elements after each comparison. Instead of checking each element one-by-one from left-to-right in the array, binary search instead starts at the middle of the current problem and compares to the middle element to decide whether the proceed left or right.

public static int binarySearch(int[] sorted, int target) {
    return binarySearch(sorted, target, 0, sorted.length);
}

private static int binarySearch(int[] sorted, int target, int low, int high) {
    if (low > high) {
        return -1;
    }
    int N = high - low;
    int mid = low + (high - low) / 2;
    if (sorted[mid] < target) {
        return binarySearch(sorted, target, mid + 1, high);
    } else if (sorted[mid] > target) {
        return binarySearch(sorted, target, low, mid - 1);
    } else {
        return mid;
    }
}

When does the best case runtime occur for binary search?

In the best case, the target is the exact middle element in the sorted array, which can be found in constant time.

How about the worst case? In the worst case, we don’t find the target immediately. In fact, we might not even find the target at all in the array, and instead return -1. In each recursive call, half the remaining elements under consideration can be discarded. In other words, the number of recursive calls is given by the answer to the question:

How many times do we need to divide N by 2 until only 1 element remains?

This is the definition of the base-2 logarithm, often written as either log₂ or the shorthand lg. Therefore, the worst case order of growth for the runtime for binarySearch is logarithmic with respect to N, the number of elements that still need to be examined. Initially, all the elements need to be examined. But as the recursion proceeds, fewer and fewer items remain until we find the target or report that it’s not found by returning -1.

How long would a logarithmic time algorithm take to process large amounts of data? Let’s compute the base-2 logarithm for big data.

When N = 1 million, log₂ 1 million is about 20.
When N = 10 million, log₂ 10 million is about 23.
When N = 100 million, log₂ 100 million is about 27.

Recurrences

The runtime of a recursive program like binary search can be more effectively modeled using recurrence relations. Recurrence relations (aka recurrences) are recursive equations that represent the order of growth for a function in two parts: (1) non-recursive work and (2) recursive work. A recurrence relation describing the worst-case asymptotic runtime for binary search is T(N) = T(N / 2) + 1.

T(N) =: On the left hand side of the equals sign, T(N) refers to the runtime for binary search in terms of the size of the current subproblem, N. In the code above, note that N = high - low.
T(N / 2) + 1: On the right hand side of the equals sign, there are two components. The recursive work is given by the expression T(N / 2) because binary search makes a single recursive call to itself on a subproblem of half the size. The non-recursive work is given by the expression 1 because, ignoring the recursive calls, binary search spends a constant amount of time comparing sorted[mid] < target.

One way to solve a recurrence is by unrolling the recurrence, an approach that works by plugging the recurrence back into itself.

T(N) = T(N / 2) + 1
T(N) = [T(N / 4) + 1] + 1
T(N) = [[T(N / 8) + 1] + 1] + 1
T(N) = [[[T(N / 16) + 1] + 1] + 1] + 1

We can start to see a pattern emerge. The recursive work will eventually go down to T(1) when it calls the base case. Along the way to the base case, a lot of 1’s are added together. How many times are we comparing sorted[mid] to the target? Put another way, we can ask: How many times do we need to divide N by 2 in order to reduce it to the number 1? This is the base-2 logarithm again!

Recursive merge sort

Both selection sort and insertion sort have worst case quadratic order of growth, and relied on iterative improvement where a sorted elements portion was maintained at the front of the sequence at all times. If recursion was able to speed-up searching, perhaps it can also speed up sorting too.

Merge sort represents a different approach to sorting based on the idea of recursion. Merge sort’s name derives from the merge operation, which takes two sorted arrays and returns a sorted result containing all the elements in both arrays.

Merge sort is a recursive sorting algorithm that can be described as follows.

If the array is of size 1, return.
Recursively merge sort the left half.
Recursively merge sort the right half.
Merge the two sorted halves.

The recurrence relation for the runtime of merge sort can be given as T(N) = T(N / 2) + T(N / 2) + N + 1. The first recursive call to T(N / 2) represents the time it takes to merge sort the left half while the second call represents the time it takes to merge sort the right half. But we could also state it more simply as just T(N) = 2T(N / 2) + N because the runtime is the same regardless of which half we’re discussing.

Recurrence diagrams

Unrolling this equation ends up producing a mess of parentheses and brackets, so let’s try drawing a recurrence diagram to explain what’s going on and use visuals to help organize our analysis. In a recurrence diagram, the recursive work is drawn as nodes (with the current subproblem depicted inside the node) while the non-recursive work is drawn beside the node. If we can add up all of the non-recursive work (numbers outside the node), then we’ve computed the total work done by the algorithm.

Here’s an example of a recurrence diagram for merge sort on a 64-element array.

The top layer takes about 64 units of time merging 2 sorted halves of 32 elements each.
The second layer also takes about 64 units of time merging 4 sorted halves of 16 elements each.
The third layer also takes about 64 units of time merging 8 sorted halves of 8 elements each.
And on and on until the algorithm reaches its base case.

Merge sort recurrence diagram

By identifying the pattern in the recurrence diagram, we can see that all the nodes that are on the same layer will take about 64 units of time. Since the entire runtime of merge sort is represented by this diagram, we can find the total time spent by multiplying the number of layers by the time spent on each layer. If we think about the problem more generally in terms of the size of the current subproblem N, then:

Each layer divides N in half until the base case of 1 element. Therefore, there are log₂ N layers.
Each layer takes about N total non-recursive time.
Therefore, the runtime of merge sort is in Θ(N log N).

We call this N log N order of growth linearithmic (a portmanteau of “linear” and “logarithmic”).

Binary Search Trees

BST.java

Identify how the properties of sorting algorithms inform BST invariants.
Identify a best/worst-case height BST insertion order for given elements.
Trace the search and insertion procedures in a BST.

A sorted array enables logarithmic-time algorithms like binary search. To determine if a certain element is in a sorted array containing 100,000,000 elements, binary search only needed log₂ 100,000,000, or about 27 comparisons! The combination of sorting algorithms and searching algorithms can enable efficient features like autocomplete that shape how people make sense of the world.

Let’s say we wanted to implement the autocomplete feature in Husky Maps. One approach is to implement a BinarySearchAutocomplete data type.

Representation: Potential search results are stored in a sorted array. In Husky Maps, this contains the names of all the places near the UW Seattle campus.
Functionality: The addAll method adds new elements to the data type. After the elements are added, the entire array is re-sorted to maintain the sorted representation.; The allMatches method returns all the elements that start with the given prefix. Since the terms are stored in a sorted array, we can use binary search to find the first term that starts with the prefix in the sorted array and then iterate to the right to collect the all the remaining prefix-matching terms.

BinarySearchAutocomplete provides a very efficient allMatches, but needs to spend a lot of time sorting all the elements in the data type after each call to addAll. Even if we add only one element, putting that element in the right place in the sorted array takes at least linear time. If addAll is called infrequently, this might not be a problem. But, in real-world mapping applications, we might need to respond to new information about the world. How might we design a more robust data type that can not only find elements efficiently, but also add or remove elements efficiently too?

When designing data structure representations, computer scientists often compare arrays and nodes.

Arrays: Enable efficient access to elements by index because each array access takes constant time no matter where the element is located inside the array. For the same reason, it’s also efficient to change a single element in an array if we have its index.
Nodes: Almost the opposite of arrays in runtime. Nodes are inefficient at accessing elements by index, but efficient at adding or removing elements without having to shift over all other elements so long as there’s a reference to the node that needs to change.

The binary search tree (BST) is a hierarchical node-based data structure that aims to combine the efficient binary search algorithm (a strength of arrays) with efficient operations for adding or removing elements (a strength of nodes). Each node in a binary search tree has a left and right child node, where all the elements to the left are less than the current element and all the elements to the right are greater than the current element. Binary search trees are commonly used to implement sets or maps.

Set abstract data type: A collection of unique elements or “keys”. Unlike lists and deques, sets do not maintain indices for each element, which enables more efficient data structure implementations.
Map abstract data type: A collection that associates each unique key with a value. Maps are like sets except each key can have a (not necessarily unique) value.

The letters in the tree are arranged in sorted order when flattened vertically: “A” is the leftmost letter in the tree while “G” is the rightmost letter in the tree. At any given node, we can decide whether to go left or right and ignore up to half the remaining elements. Just as binary search was able to find an element by recursively narrowing-down the search window, a binary search tree also shrinks its search window when considering each node. Elements can be added to a binary search tree as a new leaf in its sorted position within the tree.

The runtime of binary search tree methods depend on the height of the tree, or the number of references on the path from the overall root to the deepest leaf node. For example, the ABCDEFG tree depicted above has a height of 2 because there are only two edges from the overall root to any of the deepest leaf nodes. In this case, the height of the binary search tree (height 2) is logarithmic with respect to the total size of the tree (size 7). But BSTs are not necessarily balanced.

Open the VisuAlgo module to visualize binary search trees. Press Esc to exit the e-Lecture Mode and use Create from the bottom left menu. Try creating all of the different types of trees and comparing their properties. Try searching or inserting an element into some of the trees.
Binary Search Tree Visualization

But our asymptotic analysis of binary search trees ignores one potentially important cost: the time it takes to compare two strings by calling compareTo. We’ve assumed that the time it takes to compare any two elements is always constant and not a factor in the asymptotic runtime analysis. In a computer, characters are represented as numbers: the character ‘a’ has the numeric value 97, for example. Computers can compare these the numeric values for each character using the <, ==, and > operators in constant time.

If comparing characters takes constant time, why might string comparison take longer?

Strings can be composed of many characters. Given two strings, Java checks both strings character-by-character until the first difference or the end of the strings. Even though each character comparison may take constant time, these comparisons may need to be repeated through the length of the string. For example, DNA strings that store genetic information for organisms are composed of a large number of base pairs represented by the letters A, C, G, and T. In this situation, we might need to store a large number of very long and very similar-looking strings.

To enable efficient DNA search, we’ll need more specialized data structures designed for handling strings.

Tries

TrieST.java TST.java

Identify how the properties of sorting algorithms inform trie invariants.
Trace the search and insertion procedures in a TST and a trie.
Explain the TST and trie collection and traversal algorithms.

The trie (pronounced “try”) data structure is a specialized tree designed for storing string data by subdividing strings into individual characters. Whereas each node in a binary search tree represents an entire element, each node in a trie represents a single character within a string. To indicate that a node represents a complete word: in a trie map the value associated with the node is null; in a trie set the value associated with the node is true.

Open the Algorithm Visualizations module to visualize tries. Insert words and predict how the data structure will change.
Trie Visualization

The number of children per trie node can vary and depends on the size of the alphabet, which we often call R. In the case of English, there are 26 lowercase letters and 26 uppercase letters, so each node can maintain its own R = 52 length array with references to other nodes. By splitting words into individual characters, tries bound the time to search for a single string to the length of the string rather than the total number of strings.

Tries, however, may need to consume a lot of memory space in order to store such a large number of R-length arrays. The unicode alphabet that represents all possible char values has R = 65536. In other words, each node in this kind of trie would need to maintain its own 65536-length array. How might we reduce the memory space usage?

Ternary search trees

The ternary search tree (TST) is a specialized data structure designed for storing string data that combine ideas from tries and search trees. Just like in a trie, TSTs also subdivide each string into individual characters, giving each character its own node. Whereas a trie can have up to 65536 non-null children, TST nodes can only have up to 3 non-null children:

Left child: All strings not using the current character, and before the current string in the alphabet.
Middle child: All strings using the current character.
Right child: All strings not using the current character, and after the current string in the alphabet.

Open the Algorithm Visualizations module to visualize ternary search trees. Insert words and predict how the data structure will change.
Ternary Search Tree Visualization
Note that the visualization differs from the slides in how it marks complete words by going down one more time and creating an an extra node.

Memory hierarchy

Are tries actually more efficient than binary search trees in practice?

One limitation of asymptotic analysis is that it doesn’t help us understand how computer hardware actually runs a program. In asymptotic analysis, we assume that each individual operation takes about the same time. Although each operation in a real computer might take constant time, there can still be some dramatic differences when running a program on real computer hardware.

There are 3 key components to take away from the memory hierarchy.

Cache: Temporary space for data in a running program that may be needed soon. CPUs have a physical cache that is extremely fast, extremely small (just a couple megabytes of storage), and relatively expensive to produce. Cache is cleared when your computer shuts down.
RAM (main memory): Temporary space for your running programs that is much larger than caches. Many of today’s computers come with 8 or 16 gigabytes of RAM. However, RAM is typically 10 or 100 times slower than cache. RAM is cleared when your computer shuts down.
Disk (storage): Permanent storage space for your files and programs. It’s often much larger than main memory (usually 256GB or larger) but another 10, 100, 1000, or more times slower than RAM.

Programmers rarely have direct control over how data flows between these 3 components of a computer. Instead, these resources are typically managed by your computer’s operating system and hardware. Memory and caching is designed around two general principles that attempt to predict the program data that will be needed next.

Spatial locality: The tendency for programs to access locations nearby to recent locations. For example, iteration patterns often continue in the same direction so spatial locality suggests that adjacent or nearby memory addresses will be accessed next.
Temporal locality: The tendency for programs to access data that was recently accessed. For example, variables that are most recently declared such as loop variables are probably more likely to be accessed than variables used a long time ago.

Writing memory- and cache-efficient programs requires an understanding of how these systems interact according to general principles.