Search Trees
Binary, ternary, and trie.
Binary Search Trees
- Identify how the properties of sorting algorithms inform BST invariants.
- Identify a best/worst-case height BST insertion order for given elements.
- Trace the search and insertion procedures in a BST.
A sorted array enables logarithmic-time algorithms like binary search. To determine if a certain element is in a sorted array containing 100,000,000 elements, binary search only needed log2 100,000,000, or about 27 comparisons! The combination of sorting algorithms and searching algorithms can enable efficient features like autocomplete that shape how people make sense of the world.
Let’s say we wanted to implement the autocomplete feature in Husky Maps. One approach is to implement a BinarySearchAutocomplete
data type.
- Representation
- Potential search results are stored in a sorted array. In Husky Maps, this contains the names of all the places near the UW Seattle campus.
- Functionality
- The
addAll
method adds new elements to the data type. After the elements are added, the entire array is re-sorted to maintain the sorted representation.- The
allMatches
method returns all the elements that start with the given prefix. Since the terms are stored in a sorted array, we can use binary search to find the first term that starts with the prefix in the sorted array and then iterate to the right to collect the all the remaining prefix-matching terms. - The
BinarySearchAutocomplete
provides a very efficient allMatches
, but needs to spend a lot of time sorting all the elements in the data type after each call to addAll
. Even if we add only one element, putting that element in the right place in the sorted array takes at least linear time. If addAll
is called infrequently, this might not be a problem. But, in real-world mapping applications, we might need to respond to new information about the world. How might we design a more robust data type that can not only find elements efficiently, but also add or remove elements efficiently too?
When designing data structure representations, computer scientists often compare arrays and nodes.
- Arrays
- Enable efficient access to elements by index because each array access takes constant time no matter where the element is located inside the array. For the same reason, it’s also efficient to change a single element in an array if we have its index.
- Nodes
- Almost the opposite of arrays in runtime. Nodes are inefficient at accessing elements by index, but efficient at adding or removing elements without having to shift over all other elements so long as there’s a reference to the node that needs to change.
The binary search tree (BST) is a hierarchical node-based data structure that aims to combine the efficient binary search algorithm (a strength of arrays) with efficient operations for adding or removing elements (a strength of nodes). Each node in a binary search tree has a left and right child node, where all the elements to the left are less than the current element and all the elements to the right are greater than the current element. Binary search trees are commonly used to implement sets or maps.
- Set abstract data type
- A collection of unique elements or “keys”. Unlike lists and deques, sets do not maintain indices for each element, which enables more efficient data structure implementations.
- Map abstract data type
- A collection that associates each unique key with a value. Maps are like sets except each key can have a (not necessarily unique) value.
The letters in the tree are arranged in sorted order when flattened vertically: “A” is the leftmost letter in the tree while “G” is the rightmost letter in the tree. At any given node, we can decide whether to go left or right and ignore up to half the remaining elements. Just as binary search was able to find an element by recursively narrowing-down the search window, a binary search tree also shrinks its search window when considering each node. Elements can be added to a binary search tree as a new leaf in its sorted position within the tree.
The runtime of binary search tree methods depend on the height of the tree, or the number of references on the path from the overall root to the deepest leaf node. For example, the ABCDEFG tree depicted above has a height of 2 because there are only two edges from the overall root to any of the deepest leaf nodes. In this case, the height of the binary search tree (height 2) is logarithmic with respect to the total size of the tree (size 7). But BSTs are not necessarily balanced.
Open the VisuAlgo module to visualize binary search trees. Press
Esc
to exit the e-Lecture Mode and use Create from the bottom left menu. Try creating all of the different types of trees and comparing their properties. Try searching or inserting an element into some of the trees.
But our asymptotic analysis of binary search trees ignores one potentially important cost: the time it takes to compare two strings by calling compareTo
. We’ve assumed that the time it takes to compare any two elements is always constant and not a factor in the asymptotic runtime analysis. In a computer, characters are represented as numbers: the character ‘a’ has the numeric value 97, for example. Computers can compare these the numeric values for each character using the <
, ==
, and >
operators in constant time.
If comparing characters takes constant time, why might string comparison take longer?
Strings can be composed of many characters. Given two strings, Java checks both strings character-by-character until the first difference or the end of the strings. Even though each character comparison may take constant time, these comparisons may need to be repeated through the length of the string. For example, DNA strings that store genetic information for organisms are composed of a large number of base pairs represented by the letters A, C, G, and T. In this situation, we might need to store a large number of very long and very similar-looking strings.
To enable efficient DNA search, we’ll need more specialized data structures designed for handling strings.
Tries
- Identify how the properties of sorting algorithms inform trie invariants.
- Trace the search and insertion procedures in a TST and a trie.
- Explain the TST and trie collection and traversal algorithms.
The trie (pronounced “try”) data structure is a specialized tree designed for storing string data by subdividing strings into individual characters. Whereas each node in a binary search tree represents an entire element, each node in a trie represents a single character within a string. To indicate that a node represents a complete word: in a trie map the value associated with the node is null; in a trie set the value associated with the node is true.
Open the Algorithm Visualizations module to visualize tries. Insert words and predict how the data structure will change.
The number of children per trie node can vary and depends on the size of the alphabet, which we often call R. In the case of English, there are 26 lowercase letters and 26 uppercase letters, so each node can maintain its own R = 52 length array with references to other nodes. By splitting words into individual characters, tries bound the time to search for a single string to the length of the string rather than the total number of strings.
Tries, however, may need to consume a lot of memory space in order to store such a large number of R-length arrays. The unicode alphabet that represents all possible char
values has R = 65536. In other words, each node in this kind of trie would need to maintain its own 65536-length array. How might we reduce the memory space usage?
Ternary search trees
The ternary search tree (TST) is a specialized data structure designed for storing string data that combine ideas from tries and search trees. Just like in a trie, TSTs also subdivide each string into individual characters, giving each character its own node. Whereas a trie can have up to 65536 non-null children, TST nodes can only have up to 3 non-null children:
- Left child
- All strings not using the current character, and before the current string in the alphabet.
- Middle child
- All strings using the current character.
- Right child
- All strings not using the current character, and after the current string in the alphabet.
Open the Algorithm Visualizations module to visualize ternary search trees. Insert words and predict how the data structure will change.
Ternary Search Tree Visualization
Note that the visualization differs from the slides in how it marks complete words by going down one more time and creating an an extra node.
Memory hierarchy
Are tries actually more efficient than binary search trees in practice?
One limitation of asymptotic analysis is that it doesn’t help us understand how computer hardware actually runs a program. In asymptotic analysis, we assume that each individual operation takes about the same time. Although each operation in a real computer might take constant time, there can still be some dramatic differences when running a program on real computer hardware.
There are 3 key components to take away from the memory hierarchy.
- Cache
- Temporary space for data in a running program that may be needed soon. CPUs have a physical cache that is extremely fast, extremely small (just a couple megabytes of storage), and relatively expensive to produce. Cache is cleared when your computer shuts down.
- RAM (main memory)
- Temporary space for your running programs that is much larger than caches. Many of today’s computers come with 8 or 16 gigabytes of RAM. However, RAM is typically 10 or 100 times slower than cache. RAM is cleared when your computer shuts down.
- Disk (storage)
- Permanent storage space for your files and programs. It’s often much larger than main memory (usually 256GB or larger) but another 10, 100, 1000, or more times slower than RAM.
Programmers rarely have direct control over how data flows between these 3 components of a computer. Instead, these resources are typically managed by your computer’s operating system and hardware. Memory and caching is designed around two general principles that attempt to predict the program data that will be needed next.
- Spatial locality
- The tendency for programs to access locations nearby to recent locations. For example, iteration patterns often continue in the same direction so spatial locality suggests that adjacent or nearby memory addresses will be accessed next.
- Temporal locality
- The tendency for programs to access data that was recently accessed. For example, variables that are most recently declared such as loop variables are probably more likely to be accessed than variables used a long time ago.
Writing memory- and cache-efficient programs requires an understanding of how these systems interact according to general principles.