Search Trees
Table of contents
- Sets and maps
- Binary trees
- Binary search trees
- BST demos
- BST invariant
- Ternary search trees
- TST
- TST contains
- TST invariant
- TST construction demo
- BST insertion
- TST implementations
- BST vs TST
Sets and maps
Whereas the list abstract data type (ADT) represents an indexed collection of elements, the set ADT represents an unindexed collection of unique elements.
add
an element to the set.contains
to return whether or not a value is in the set.remove
and return an element from the set.size
to return the number of elements in the set.
The map ADT represents an unindexed collection of key–value entries. Entries are accessed by unique key. Each unique key is associated with a value that might not be unique.
containsKey
returns true if the givenkey
is in the map.get
returns the value associated with the givenkey
, or null if thekey
is not in the map.put
associates the givenkey
with the givenvalue
.remove
to remove the entry for the given key and return its associated value.
Since set and map clients cannot access elements by index (element positions), set and map implementers can organize data in creative ways. When elements are stored in a linear sequence one after the other as in an ArrayList
or LinkedList
, the algorithm for implementing contains
needs to perform a linear search across potentially all of the elements because the target element could be stored anywhere in the sequence. In this lesson, we’ll introduce the search tree data structure concept as a way to store and access elements more efficiently than by storing them sequentially.
- Goal
- Develop a faster data structure for implementing the set and map ADTs.
Binary trees
Binary trees are a linked node data structure where each node maintains references to a left
and a right
child. Each node in a binary tree has either 0, 1, or 2 children.
public class IntTree {
private IntTreeNode overallRoot;
private static class IntTreeNode {
public int data;
public IntTreeNode left;
public IntTreeNode right;
}
}
Like linked nodes, each IntTreeNode
has a structural-recursive definition. A binary tree is either empty or non-empty.
- Empty binary tree
- The root
IntTreeNode
is null. - Non-empty binary tree
- The root
IntTreeNode
contains somedata
, aleft
binary subtree, and aright
binary subtree.
Binary search trees
Binary trees alone do not provide a speed advantage over linear search because elements can still be stored anywhere in a binary tree. As a result, contains
on a binary tree might need to check the entire tree.
- Linear search
- A linear-time algorithm for finding an element in a data structure by comparing one element each time.
- Binary search
- A logarithmic-time algorithm for finding an element in a sorted range by comparing the middle element in the range and using the comparison result to strategically ignore half of the remaining elements in the range each time. (Much faster than linear search.)
Unfortunately, arrays are fixed in size, making it slow to add new elements to a sorted array. Thus, inserting or removing from a sorted array set is a linear-time operation.
- Goal
- Develop a set data structure that not only yields logarithmic-time
contains
, but also logarithmic-timeadd
andremove
.
Why is binary search in a sorted linked list so much slower than in a sorted array?
While arrays provide fast random access to the element at a given index, linked lists need to iteratively/recursively traverse to the given index.
To speed up binary search on a sorted linked list, first make the middle node the overall root. Then, to reach elements in the left half of the data structure, flip the direction of the edges. Finally, apply the edge-flipping rule recursively to model the way binary search would explore the possible spaces.
Binary search trees enhance the binary tree data structure by introducing the binary search tree invariant.
- Binary search tree invariant
- All elements in the left subtree are less than the node’s data.
- All elements in the right subtree are greater than the node’s data.
This definition is recursive: in a valid binary search tree, the binary search tree invariant must hold for every subtree. Binary search trees organize data such that they’re lexicographically-ordered (sorted) within the tree!
// pre : Binary search tree invariant.
// post: Returns true if the value is in this tree, false otherwise.
public boolean contains(int value) {
return contains(overallRoot, value);
}
private boolean contains(IntTreeNode root, int value) {
if (root == null) {
return false;
} else if (root.data == value) {
return true;
} else if (value < root.data) {
return contains(root.left, value);
} else {
return contains(root.right, value);
}
}
BST demos
BST invariant
Binary search trees represent items based on the lexicographic ordering of elements. The binary search tree invariants tells us that each decision to either go left or right bounds the possible values in each subtree. This is also how binary search on a sorted array allows the algorithm implementer to ignore half of the remaining elements on each iteration!
Question 1
Consider the following binary search tree storing emoji character data. Reorder the emoji into lexicographic (sorted) order, where the first emoji is the smallest and the last emoji is the largest. (Emoji icons may not appear exactly as shown in the diagram.)
Explanation
Since binary search trees are analogous to a sorted array (for the binary search algorithm), an in-order traversal of a binary search tree yields the underlying sorted structure of the elements. This is why iterating over a TreeSet
displays each element in sorted order.
- In-order traversal
- Recursively traverse the left subtree.
- Print the element stored at the current node.
- Recursively traverse the right subtree.
Question 2
Consider the following binary search tree with the emoji data for some nodes swapped out with placeholders A, B, C, D, and E. If we have an emoji character that is lexicographically between the heart emoji ❤️ and the lightbulb emoji 💡, which nodes could possibly contain it?
Explanation
If the heart and lightbulb serve as lower and upper bounds, then the in-order traversal states that C and D are the only elements between the two bounds. You can also imagine “squishing” the tree vertically so that all of the items are laid out flat.
Ternary search trees
What are the drawbacks of inserting new values as leaf nodes in a BST?
- Degenerate trees
- Different insertion orders can result in differently-shaped binary search trees. In the very worst-case, each node might only have either a left or a right child. We end up with the same runtime problems that we experienced with a sorted linked list!
- ADT assumptions
- Each element in a
Set
orMap
is a reference to a unique object in memory. But in a specialized application such as DNA storage, each DNA sequence may share repeated subsequences. Storing many long DNA sequences as distinct strings could lead to anOutOfMemoryError
.
In the following weeks, we’ll address degenerate trees by introducing invariants that ensure balanced trees regardless of the insertion order. Today, we’ll introduce the ternary search tree (TST), a specialized data structure for implementing string-only sets and maps. Instead of storing a reference to a distinct string, each node in a TST only stores a single character as data. The character-by-character path from the overall root to a node in the TST represents each string in the set or map.
- Ternary search tree invariant
- All elements in the left subtree are less than (not using) the node’s data.
- All elements in the middle subtree are equal to (using) the node’s data.
- All elements in the right subtree are greater than (not using) the node’s data.
- All elements in the middle subtree are equal to (using) the node’s data.
TST
TST contains
TST invariant
- Robert Sedgewick and Kevin Wayne. Tries. In COS 226 Fall 2020. https://www.cs.princeton.edu/courses/archive/fall20/cos226/lectures/52Tries.pdf
To implement contains(String value)
in a TST, compare the character at the current index in the given value
to the node data and follow links accordingly:
- If less, go left.
- If greater, go right.
- If equal: if it is the last character, return the value of the mark; otherwise, go middle and advance to the next index.
Since not all nodes in the TST represent completed strings, each node contains a boolean (depicted in the diagram with an asterisk) to mark whether or not the path to that node represents a term in the set or map. An in-order traversal of the above TST would yield the following strings.
- are
- by
- sea
- sells
- she
- shells
- shore
- surely
- the
Try inserting “she”, “shells”, “shore”, and “shores” into the TST algorithm visualizer. Note that the algorithm visualizer uses a different rule for the equal case and continues down one more node when on the last character.
Question 1
Which strings are stored in the subtree denoted by the curvy arrow?
- Strings that start with s
- Strings that start with se
- Strings that start with sh
- Strings that start with she
Explanation
- By going down at the overall root, we’re considering all strings starting with s.
- By going left at sh, we’re considering all strings starting with s but before sh.
- By going down at se, we’re considering all strings starting with se.
Question 2
Which node is associated with the string CAC?
Question 3
In which subtree would the string CCC be inserted in?
TST construction demo
BST insertion
In a binary search tree implementation of the set ADT, the add
method inserts new values as leaf nodes.
public void add(int value) {
overallRoot = add(overallRoot, value);
}
private IntTreeNode add(IntTreeNode root, int value) {
if (root == null) {
return new IntTreeNode(value);
} else if (value < root.data) {
root.left = add(root.left, value);
} else if (value > root.data) {
root.right = add(root.right, value);
}
return root;
}
In this problem, we’ll consider the implications of this add
implementation. Say we want to store the integers 1, 2, 3, 7, 8, 9, 5 in a binary search tree set. For each question, visualize the binary search tree and check your answer with the BST algorithm visualizer.
The height of a tree is the number of edges on the longest path from the overall root to any leaf node, so the height of a tree with just a single node is 0.
Question 1
What is the minimum possible height of a BST containing 7 unique elements?
Explanation
A perfectly balanced or “bushy” BST has height 2!
Question 2
Which of the following options are valid insertion orders for the minimum height BST containing the integers 1, 2, 3, 7, 8, 9, 5?
- 5, 2, 1, 3, 8, 7, 9
- 5, 8, 2, 7, 9, 3, 1
- 5, 1, 3, 2, 8, 7, 9
- 7, 3, 1, 2, 8, 9, 5
- 5, 2, 8, 9, 7, 1, 3
Explanation
Since values are added as new leaf nodes, we need to make sure to fill the tree in a top-ish to bottom-ish order. So all of the parents need to be inserted before their children!
Question 3
What is the maximum possible height of a BST containing 7 unique elements?
Explanation
A “spindly” or “degenerate” BST that looks like a linked list has height 6.
Question 4
Which of the following options are valid insertion orders for a maximum height BST containing the integers 1, 2, 3, 7, 8, 9, 5? (Note that there are many possible maximum-height BSTs!)
- 1, 2, 3, 5, 7, 8, 9
- 1, 9, 2, 8, 3, 7, 5
- 9, 8, 7, 5, 3, 2, 1
- 1, 2, 3, 9, 8, 7, 5
- 8, 9, 1, 2, 3, 5, 7
TST implementations
Consider the TST class designed by Robert Sedgewick and Kevin Wayne, which implements the map ADT with a ternary search tree data structure.
Question 1
Explain in plain English how the keysWithPrefix
method computes the result.
Explain why queue.add(prefix.toString() + x.c)
occurs after collect(x.left, ...)
and before collect(x.mid, ...)
.
- Note
StringBuilder
is like anArrayList<Character>
. Use the Java Visualizer to study a particular detail of the algorithm.
Explanation
First, get
the subtree with the given prefix
. If there is no subtree with the given prefix
, the algorithm can exit early since there are no keys-with-prefix.
Then, starting at the subtree with the given prefix
, perform an in-order traversal on the TST. Strings will be built-up using the choose-explore-unchoose pattern with a StringBuilder
object.
The in-order traversal for a TST is similar to a BST, but we want to make sure to add the current node value before processing its middle child. This way, “she” will appear before “shells” in the output.
BST vs TST
We know that binary search trees (BSTs) are sensitive to insertion order. Inserting elements in sorted order results in a degenerate tree with poor runtime performance. How do TSTs compare? We know that TSTs are not designed to address this issue directly, but ternary search trees are different enough that it warrants some analysis.
Question 1
Consider the worst-possible insertion order for any 10 strings of length 3 into a BST. What is the height of the tree after these insertions?
Question 2
Consider the worst-possible insertion order for any 10 strings of length 3 into a TST. What is the height of the tree after these insertions?
Explanation
Use the TST algorithm visualization. Make sure to avoid counting the edge to empty mark nodes in the longest path.
- abc
- bcd
- cde
- def
- efg
- fgh
- ghi
- hij
- ijk
- jkl