Search Trees

Sets and maps
Binary trees
Binary search trees
BST demos
BST invariant
Ternary search trees
TST
TST contains
TST invariant
TST construction demo
BST insertion
TST implementations
BST vs TST

Sets and maps

Whereas the list abstract data type (ADT) represents an indexed collection of elements, the set ADT represents an unindexed collection of unique elements.

add an element to the set.
contains to return whether or not a value is in the set.
remove and return an element from the set.
size to return the number of elements in the set.

The map ADT represents an unindexed collection of key–value entries. Entries are accessed by unique key. Each unique key is associated with a value that might not be unique.

containsKey returns true if the given key is in the map.
get returns the value associated with the given key, or null if the key is not in the map.
put associates the given key with the given value.
remove to remove the entry for the given key and return its associated value.

Since set and map clients cannot access elements by index (element positions), set and map implementers can organize data in creative ways. When elements are stored in a linear sequence one after the other as in an ArrayList or LinkedList, the algorithm for implementing contains needs to perform a linear search across potentially all of the elements because the target element could be stored anywhere in the sequence. In this lesson, we’ll introduce the search tree data structure concept as a way to store and access elements more efficiently than by storing them sequentially.

Goal: Develop a faster data structure for implementing the set and map ADTs.

Binary trees

Binary trees are a linked node data structure where each node maintains references to a left and a right child. Each node in a binary tree has either 0, 1, or 2 children.

Tree Terminology

public class IntTree {
    private IntTreeNode overallRoot;

    private static class IntTreeNode {
        public int data;
        public IntTreeNode left;
        public IntTreeNode right;
    }
}

Like linked nodes, each IntTreeNode has a structural-recursive definition. A binary tree is either empty or non-empty.

Empty binary tree: The root IntTreeNode is null.
Non-empty binary tree: The root IntTreeNode contains some data, a left binary subtree, and a right binary subtree.

Binary search trees

Binary trees alone do not provide a speed advantage over linear search because elements can still be stored anywhere in a binary tree. As a result, contains on a binary tree might need to check the entire tree.

Linear search: A linear-time algorithm for finding an element in a data structure by comparing one element each time.
Binary search: A logarithmic-time algorithm for finding an element in a sorted range by comparing the middle element in the range and using the comparison result to strategically ignore half of the remaining elements in the range each time. (Much faster than linear search.)

Unfortunately, arrays are fixed in size, making it slow to add new elements to a sorted array. Thus, inserting or removing from a sorted array set is a linear-time operation.

Goal: Develop a set data structure that not only yields logarithmic-time contains, but also logarithmic-time add and remove.

Why is binary search in a sorted linked list so much slower than in a sorted array?

While arrays provide fast random access to the element at a given index, linked lists need to iteratively/recursively traverse to the given index.

To speed up binary search on a sorted linked list, first make the middle node the overall root. Then, to reach elements in the left half of the data structure, flip the direction of the edges. Finally, apply the edge-flipping rule recursively to model the way binary search would explore the possible spaces.

Binary search trees enhance the binary tree data structure by introducing the binary search tree invariant.

Binary search tree invariant: All elements in the left subtree are less than the node’s data.; All elements in the right subtree are greater than the node’s data.

This definition is recursive: in a valid binary search tree, the binary search tree invariant must hold for every subtree. Binary search trees organize data such that they’re lexicographically-ordered (sorted) within the tree!

// pre : Binary search tree invariant.
// post: Returns true if the value is in this tree, false otherwise.
public boolean contains(int value) {
    return contains(overallRoot, value);
}

private boolean contains(IntTreeNode root, int value) {
    if (root == null) {
        return false;
    } else if (root.data == value) {
        return true;
    } else if (value < root.data) {
        return contains(root.left, value);
    } else {
        return contains(root.right, value);
    }
}

BST demos

BST invariant

BST Invariant

Binary search trees represent items based on the lexicographic ordering of elements. The binary search tree invariants tells us that each decision to either go left or right bounds the possible values in each subtree. This is also how binary search on a sorted array allows the algorithm implementer to ignore half of the remaining elements on each iteration!

Question 1

Consider the following binary search tree storing emoji character data. Reorder the emoji into lexicographic (sorted) order, where the first emoji is the smallest and the last emoji is the largest. (Emoji icons may not appear exactly as shown in the diagram.)

BST Emoji Ordering

Explanation

Since binary search trees are analogous to a sorted array (for the binary search algorithm), an in-order traversal of a binary search tree yields the underlying sorted structure of the elements. This is why iterating over a TreeSet displays each element in sorted order.

In-order traversal

Recursively traverse the left subtree.
Print the element stored at the current node.
Recursively traverse the right subtree.

Question 2

Consider the following binary search tree with the emoji data for some nodes swapped out with placeholders A, B, C, D, and E. If we have an emoji character that is lexicographically between the heart emoji ❤️ and the lightbulb emoji 💡, which nodes could possibly contain it?

BST Emoji Options

Explanation

If the heart and lightbulb serve as lower and upper bounds, then the in-order traversal states that C and D are the only elements between the two bounds. You can also imagine “squishing” the tree vertically so that all of the items are laid out flat.

Ternary search trees

What are the drawbacks of inserting new values as leaf nodes in a BST?

Degenerate trees: Different insertion orders can result in differently-shaped binary search trees. In the very worst-case, each node might only have either a left or a right child. We end up with the same runtime problems that we experienced with a sorted linked list!
ADT assumptions: Each element in a Set or Map is a reference to a unique object in memory. But in a specialized application such as DNA storage, each DNA sequence may share repeated subsequences. Storing many long DNA sequences as distinct strings could lead to an OutOfMemoryError.

In the following weeks, we’ll address degenerate trees by introducing invariants that ensure balanced trees regardless of the insertion order. Today, we’ll introduce the ternary search tree (TST), a specialized data structure for implementing string-only sets and maps. Instead of storing a reference to a distinct string, each node in a TST only stores a single character as data. The character-by-character path from the overall root to a node in the TST represents each string in the set or map.

Ternary search tree invariant: All elements in the left subtree are less than (not using) the node’s data.; All elements in the middle subtree are equal to (using) the node’s data.; All elements in the right subtree are greater than (not using) the node’s data.

TST

TST contains

TST invariant

TST

Robert Sedgewick and Kevin Wayne. Tries. In COS 226 Fall 2020. https://www.cs.princeton.edu/courses/archive/fall20/cos226/lectures/52Tries.pdf

To implement contains(String value) in a TST, compare the character at the current index in the given value to the node data and follow links accordingly:

If less, go left.
If greater, go right.
If equal: if it is the last character, return the value of the mark; otherwise, go middle and advance to the next index.

Since not all nodes in the TST represent completed strings, each node contains a boolean (depicted in the diagram with an asterisk) to mark whether or not the path to that node represents a term in the set or map. An in-order traversal of the above TST would yield the following strings.

are
by
sea
sells
she
shells
shore
surely
the

Try inserting “she”, “shells”, “shore”, and “shores” into the TST algorithm visualizer. Note that the algorithm visualizer uses a different rule for the equal case and continues down one more node when on the last character.

Question 1

Which strings are stored in the subtree denoted by the curvy arrow?

TST se Prefix

Strings that start with s
Strings that start with se
Strings that start with sh
Strings that start with she

Explanation

By going down at the overall root, we’re considering all strings starting with s.
By going left at sh, we’re considering all strings starting with s but before sh.
By going down at se, we’re considering all strings starting with se.

Question 2

Which node is associated with the string CAC?

TST Example

Question 3

In which subtree would the string CCC be inserted in?

TST construction demo

BST insertion

In a binary search tree implementation of the set ADT, the add method inserts new values as leaf nodes.

public void add(int value) {
    overallRoot = add(overallRoot, value);
}

private IntTreeNode add(IntTreeNode root, int value) {
    if (root == null) {
        return new IntTreeNode(value);
    } else if (value < root.data) {
        root.left = add(root.left, value);
    } else if (value > root.data) {
        root.right = add(root.right, value);
    }
    return root;
}

In this problem, we’ll consider the implications of this add implementation. Say we want to store the integers 1, 2, 3, 7, 8, 9, 5 in a binary search tree set. For each question, visualize the binary search tree and check your answer with the BST algorithm visualizer.

The height of a tree is the number of edges on the longest path from the overall root to any leaf node, so the height of a tree with just a single node is 0.

Question 1

What is the minimum possible height of a BST containing 7 unique elements?

Explanation

A perfectly balanced or “bushy” BST has height 2!

Question 2

Which of the following options are valid insertion orders for the minimum height BST containing the integers 1, 2, 3, 7, 8, 9, 5?

Explanation

Since values are added as new leaf nodes, we need to make sure to fill the tree in a top-ish to bottom-ish order. So all of the parents need to be inserted before their children!

Question 3

What is the maximum possible height of a BST containing 7 unique elements?

Explanation

A “spindly” or “degenerate” BST that looks like a linked list has height 6.

Question 4

Which of the following options are valid insertion orders for a maximum height BST containing the integers 1, 2, 3, 7, 8, 9, 5? (Note that there are many possible maximum-height BSTs!)

TST implementations

Consider the TST class designed by Robert Sedgewick and Kevin Wayne, which implements the map ADT with a ternary search tree data structure.

Question 1

Explain in plain English how the keysWithPrefix method computes the result.

Explain why queue.add(prefix.toString() + x.c) occurs after collect(x.left, ...) and before collect(x.mid, ...).

Note: StringBuilder is like an ArrayList<Character>. Use the Java Visualizer to study a particular detail of the algorithm.

Explanation

First, get the subtree with the given prefix. If there is no subtree with the given prefix, the algorithm can exit early since there are no keys-with-prefix.

Then, starting at the subtree with the given prefix, perform an in-order traversal on the TST. Strings will be built-up using the choose-explore-unchoose pattern with a StringBuilder object.

The in-order traversal for a TST is similar to a BST, but we want to make sure to add the current node value before processing its middle child. This way, “she” will appear before “shells” in the output.

BST vs TST

We know that binary search trees (BSTs) are sensitive to insertion order. Inserting elements in sorted order results in a degenerate tree with poor runtime performance. How do TSTs compare? We know that TSTs are not designed to address this issue directly, but ternary search trees are different enough that it warrants some analysis.

Question 1

Consider the worst-possible insertion order for any 10 strings of length 3 into a BST. What is the height of the tree after these insertions?

Question 2

Consider the worst-possible insertion order for any 10 strings of length 3 into a TST. What is the height of the tree after these insertions?

Explanation

Use the TST algorithm visualization. Make sure to avoid counting the edge to empty mark nodes in the longest path.

Search Trees

Table of contents

Sets and maps

Binary trees

Binary search trees

BST demos

BST invariant

Question 1

Question 2

Ternary search trees

TST

TST contains

TST invariant

Question 1

Question 2

Question 3

TST construction demo

BST insertion

Question 1

Question 2

Question 3

Question 4

TST implementations

Question 1

BST vs TST

Question 1

Question 2