# Search Trees

## Sets and maps

Whereas the list abstract data type (ADT) represents an indexed collection of elements, the set ADT represents an unindexed collection of unique elements.

• `add` an element to the set.
• `contains` to return whether or not a value is in the set.
• `remove` and return an element from the set.
• `size` to return the number of elements in the set.

The map ADT represents an unindexed collection of key–value entries. Entries are accessed by unique key. Each unique key is associated with a value that might not be unique.

• `containsKey` returns true if the given `key` is in the map.
• `get` returns the value associated with the given `key`, or null if the `key` is not in the map.
• `put` associates the given `key` with the given `value`.
• `remove` to remove the entry for the given key and return its associated value.

Since set and map clients cannot access elements by index (element positions), set and map implementers can organize data in creative ways. When elements are stored in a linear sequence one after the other as in an `ArrayList` or `LinkedList`, the algorithm for implementing `contains` needs to perform a linear search across potentially all of the elements because the target element could be stored anywhere in the sequence. In this lesson, we’ll introduce the search tree data structure concept as a way to store and access elements more efficiently than by storing them sequentially.

Goal
Develop a faster data structure for implementing the set and map ADTs.

## Binary trees

Binary trees are a linked node data structure where each node maintains references to a `left` and a `right` child. Each node in a binary tree has either 0, 1, or 2 children. ``````public class IntTree {
private IntTreeNode overallRoot;

private static class IntTreeNode {
public int data;
public IntTreeNode left;
public IntTreeNode right;
}
}
``````

Like linked nodes, each `IntTreeNode` has a structural-recursive definition. A binary tree is either empty or non-empty.

Empty binary tree
The root `IntTreeNode` is null.
Non-empty binary tree
The root `IntTreeNode` contains some `data`, a `left` binary subtree, and a `right` binary subtree.

## Binary search trees

Binary trees alone do not provide a speed advantage over linear search because elements can still be stored anywhere in a binary tree. As a result, `contains` on a binary tree might need to check the entire tree.

Linear search
A linear-time algorithm for finding an element in a data structure by comparing one element each time.
Binary search
A logarithmic-time algorithm for finding an element in a sorted range by comparing the middle element in the range and using the comparison result to strategically ignore half of the remaining elements in the range each time. (Much faster than linear search.)

Unfortunately, arrays are fixed in size, making it slow to add new elements to a sorted array. Thus, inserting or removing from a sorted array set is a linear-time operation.

Goal
Develop a set data structure that not only yields logarithmic-time `contains`, but also logarithmic-time `add` and `remove`.
Why is binary search in a sorted linked list so much slower than in a sorted array?

While arrays provide fast random access to the element at a given index, linked lists need to iteratively/recursively traverse to the given index.

To speed up binary search on a sorted linked list, first make the middle node the overall root. Then, to reach elements in the left half of the data structure, flip the direction of the edges. Finally, apply the edge-flipping rule recursively to model the way binary search would explore the possible spaces.

Binary search trees enhance the binary tree data structure by introducing the binary search tree invariant.

Binary search tree invariant
All elements in the left subtree are less than the node’s data.
All elements in the right subtree are greater than the node’s data.

This definition is recursive: in a valid binary search tree, the binary search tree invariant must hold for every subtree. Binary search trees organize data such that they’re lexicographically-ordered (sorted) within the tree!

``````// pre : Binary search tree invariant.
// post: Returns true if the value is in this tree, false otherwise.
public boolean contains(int value) {
return contains(overallRoot, value);
}

private boolean contains(IntTreeNode root, int value) {
if (root == null) {
return false;
} else if (root.data == value) {
return true;
} else if (value < root.data) {
return contains(root.left, value);
} else {
return contains(root.right, value);
}
}
``````

## BST invariant Binary search trees represent items based on the lexicographic ordering of elements. The binary search tree invariants tells us that each decision to either go left or right bounds the possible values in each subtree. This is also how binary search on a sorted array allows the algorithm implementer to ignore half of the remaining elements on each iteration!

### Question 1

Consider the following binary search tree storing emoji character data. Reorder the emoji into lexicographic (sorted) order, where the first emoji is the smallest and the last emoji is the largest. (Emoji icons may not appear exactly as shown in the diagram.) Explanation

Since binary search trees are analogous to a sorted array (for the binary search algorithm), an in-order traversal of a binary search tree yields the underlying sorted structure of the elements. This is why iterating over a `TreeSet` displays each element in sorted order.

In-order traversal
1. Recursively traverse the left subtree.
2. Print the element stored at the current node.
3. Recursively traverse the right subtree.

### Question 2

Consider the following binary search tree with the emoji data for some nodes swapped out with placeholders A, B, C, D, and E. If we have an emoji character that is lexicographically between the heart emoji ❤️ and the lightbulb emoji 💡, which nodes could possibly contain it? Explanation

If the heart and lightbulb serve as lower and upper bounds, then the in-order traversal states that C and D are the only elements between the two bounds. You can also imagine “squishing” the tree vertically so that all of the items are laid out flat.

## Ternary search trees

What are the drawbacks of inserting new values as leaf nodes in a BST?

Degenerate trees
Different insertion orders can result in differently-shaped binary search trees. In the very worst-case, each node might only have either a left or a right child. We end up with the same runtime problems that we experienced with a sorted linked list!
Each element in a `Set` or `Map` is a reference to a unique object in memory. But in a specialized application such as DNA storage, each DNA sequence may share repeated subsequences. Storing many long DNA sequences as distinct strings could lead to an `OutOfMemoryError`.

In the following weeks, we’ll address degenerate trees by introducing invariants that ensure balanced trees regardless of the insertion order. Today, we’ll introduce the ternary search tree (TST), a specialized data structure for implementing string-only sets and maps. Instead of storing a reference to a distinct string, each node in a TST only stores a single character as data. The character-by-character path from the overall root to a node in the TST represents each string in the set or map.

Ternary search tree invariant
All elements in the left subtree are less than (not using) the node’s data.
All elements in the middle subtree are equal to (using) the node’s data.
All elements in the right subtree are greater than (not using) the node’s data.

## TST invariant 1. Robert Sedgewick and Kevin Wayne. Tries. In COS 226 Fall 2020. https://www.cs.princeton.edu/courses/archive/fall20/cos226/lectures/52Tries.pdf

To implement `contains(String value)` in a TST, compare the character at the current index in the given `value` to the node data and follow links accordingly:

• If less, go left.
• If greater, go right.
• If equal: if it is the last character, return the value of the mark; otherwise, go middle and advance to the next index.

Since not all nodes in the TST represent completed strings, each node contains a boolean (depicted in the diagram with an asterisk) to mark whether or not the path to that node represents a term in the set or map. An in-order traversal of the above TST would yield the following strings.

• are
• by
• sea
• sells
• she
• shells
• shore
• surely
• the

Try inserting “she”, “shells”, “shore”, and “shores” into the TST algorithm visualizer. Note that the algorithm visualizer uses a different rule for the equal case and continues down one more node when on the last character.

### Question 1

Which strings are stored in the subtree denoted by the curvy arrow? Explanation
1. By going down at the overall root, we’re considering all strings starting with s.
2. By going left at sh, we’re considering all strings starting with s but before sh.
3. By going down at se, we’re considering all strings starting with se.

### Question 2

Which node is associated with the string CAC? ### Question 3

In which subtree would the string CCC be inserted in?

## BST insertion

In a binary search tree implementation of the set ADT, the `add` method inserts new values as leaf nodes.

``````public void add(int value) {
overallRoot = add(overallRoot, value);
}

private IntTreeNode add(IntTreeNode root, int value) {
if (root == null) {
return new IntTreeNode(value);
} else if (value < root.data) {
root.left = add(root.left, value);
} else if (value > root.data) {
root.right = add(root.right, value);
}
return root;
}
``````

In this problem, we’ll consider the implications of this `add` implementation. Say we want to store the integers 1, 2, 3, 7, 8, 9, 5 in a binary search tree set. For each question, visualize the binary search tree and check your answer with the BST algorithm visualizer.

The height of a tree is the number of edges on the longest path from the overall root to any leaf node, so the height of a tree with just a single node is 0.

### Question 1

What is the minimum possible height of a BST containing 7 unique elements?

Explanation

A perfectly balanced or “bushy” BST has height 2!

### Question 2

Which of the following options are valid insertion orders for the minimum height BST containing the integers 1, 2, 3, 7, 8, 9, 5?

• 5, 2, 1, 3, 8, 7, 9
• 5, 8, 2, 7, 9, 3, 1
• 5, 1, 3, 2, 8, 7, 9
• 7, 3, 1, 2, 8, 9, 5
• 5, 2, 8, 9, 7, 1, 3
Explanation

Since values are added as new leaf nodes, we need to make sure to fill the tree in a top-ish to bottom-ish order. So all of the parents need to be inserted before their children!

### Question 3

What is the maximum possible height of a BST containing 7 unique elements?

Explanation

A “spindly” or “degenerate” BST that looks like a linked list has height 6.

### Question 4

Which of the following options are valid insertion orders for a maximum height BST containing the integers 1, 2, 3, 7, 8, 9, 5? (Note that there are many possible maximum-height BSTs!)

• 1, 2, 3, 5, 7, 8, 9
• 1, 9, 2, 8, 3, 7, 5
• 9, 8, 7, 5, 3, 2, 1
• 1, 2, 3, 9, 8, 7, 5
• 8, 9, 1, 2, 3, 5, 7

## TST implementations

Consider the TST class designed by Robert Sedgewick and Kevin Wayne, which implements the map ADT with a ternary search tree data structure.

### Question 1

Explain in plain English how the `keysWithPrefix` method computes the result.

Explain why `queue.add(prefix.toString() + x.c)` occurs after `collect(x.left, ...)` and before `collect(x.mid, ...)`.

Note
`StringBuilder` is like an `ArrayList<Character>`. Use the Java Visualizer to study a particular detail of the algorithm.
Explanation

First, `get` the subtree with the given `prefix`. If there is no subtree with the given `prefix`, the algorithm can exit early since there are no keys-with-prefix.

Then, starting at the subtree with the given `prefix`, perform an in-order traversal on the TST. Strings will be built-up using the choose-explore-unchoose pattern with a `StringBuilder` object.

The in-order traversal for a TST is similar to a BST, but we want to make sure to add the current node value before processing its middle child. This way, “she” will appear before “shells” in the output.

## BST vs TST

We know that binary search trees (BSTs) are sensitive to insertion order. Inserting elements in sorted order results in a degenerate tree with poor runtime performance. How do TSTs compare? We know that TSTs are not designed to address this issue directly, but ternary search trees are different enough that it warrants some analysis.

### Question 1

Consider the worst-possible insertion order for any 10 strings of length 3 into a BST. What is the height of the tree after these insertions?

### Question 2

Consider the worst-possible insertion order for any 10 strings of length 3 into a TST. What is the height of the tree after these insertions?

Explanation

Use the TST algorithm visualization. Make sure to avoid counting the edge to empty mark nodes in the longest path.

1. abc
2. bcd
3. cde
4. def
5. efg
6. fgh
7. ghi
8. hij
9. ijk
10. jkl