Heaps and Hashing

Binary heaps, hash tables, and counting sorts.

Hash Tables
Counting Sorts
Binary Heaps

Hash Tables

Explain and trace hash table algorithms such as insertion and search.
Evaluate how a hash table will behave in response to a given data type.
Analyze the runtime of a hash table with a given bucket data structure.

Java’s TreeSet and TreeMap classes are implemented using red-black trees that guarantee logarithmic time for most individual element operations. This is a massive improvement over sequential search: a balanced search tree containing over 1 million elements has a height on the order of about 20!

But it turns out that we can do even better by studying Java’s HashSet and HashMap classes, which are implemented using hash tables. Under certain conditions, hash tables can have constant time for most individual element operations—regardless of whether your hash table contains 1 million, 1 billion, 1 trillion, or even more elements.

Data-indexed integer set case study

Arrays store data contiguously in memory where each array index is next to each other. Unlike linked lists, arrays do not need to follow references throughout memory. Indexing into an array is a constant time operation as a result.

DataIndexedIntegerSet uses this idea of an array to implement a set of integers. It maintains a boolean[] present of length 2 billion, one index for almost every non-negative integer. The boolean value stored at each index present[x] represents whether or not the integer x is in the set.

void add(int x): Adds x to the set by assigning present[x] = true.
boolean contains(int x): Returns the value of present[x].
void remove(int x): Removes x from the set by assigning present[x] = false.

While this implementation is simple and fast, it takes a lot of memory to store the 2-billion length present array—and it doesn’t even work with negative integers or other data types. How might we generalize this concept to support strings?

Data-indexed string set case study

Let’s design a DataIndexedStringSet using a 27-element array: 1 array index for each lowercase letter in the English alphabet starting at index 1. A word starting with the letter ‘a’ goes to index 1, ‘b’ goes to index 2, ‘c’ goes to index 3, etc. To study the behavior of this program, let’s consider these three test cases.

DataIndexedStringSet set = new DataIndexedStringSet();
set.add("cat");

System.out.println(set.contains("car"));

DataIndexedStringSet set = new DataIndexedStringSet();
set.add("cat");

set.add("car");
set.remove("cat");
System.out.println(set.contains("car"));

DataIndexedStringSet set = new DataIndexedStringSet();
set.add("cat");

set.add("dog");
System.out.println(set.contains("dog"));

Some of the test cases work as we would expect a set to work, but other test cases don’t work.

Which of these test cases will behave correctly?

Only the third test case works as expected. “cat” and “car” collide because they share the same index, so changes to “cat” will also affect “car” and vice versa. This is unexpected because the set should treat the two strings as different from each other, but it turns out that changing the state of one will also change the state of the other.

A collision occurs when two or more elements share the same index. One way we can avoid collisions in a DataIndexedStringSet is to be more careful about how we assign each English word. If we make sure every English word has its own unique index, then we can avoid collisions!

Suppose the string “a” goes to 1, “b” goes to 2, “c” goes to 3, …, and “z” goes to 26. Then the string “aa” should go to 27, “ab” to 28, “ac” to 29, and so forth. Since there are 26 letters in the English alphabet, we call this enumeration scheme base 26. Each string gets its own number, so as strings get longer and longer, the number assigned to each string grows larger and larger as well.

Using base 26 numbering, we can compute the index for “cat” as (3 ∙ 26²) + (1 ∙ 26¹) + (20 ∙ 26⁰) = 2074. Breaking down this equation, we see that “cat” is a 3-letter word so there are 3 values added together:

(3 ∙ 26²): 3 represents the letter ‘c’, which is the 3rd letter in the alphabet.; 26 is raised to the 2nd power for the two subsequent letters.
(1 ∙ 26¹): 1 represents the letter ‘a’, which is the 1st letter in the alphabet.; 26 is raised to the 1st power for the one subsequent letter.
(20 ∙ 26⁰): 20 represents the letter ‘t’, which is the 20th letter in the alphabet.; 26 is raised to the 0th power for the zero subsequent letters.

So long as we pick a base that’s at least 26, this hash function is guaranteed to assign each lowercase English word a unique integer. But it doesn’t work with uppercase characters or punctuation. Furthermore, if we want to support other languages than English, we’ll need an even larger base. For example, there are 40,959 characters in the Chinese language alone. If we wanted to support all the possible characters in the Java built-in char type—which includes emojis, punctuation, and characters across many languages—we would need base 65536. Hash values generated this way will grow exponentially with respect to the length of the string!

Separate chaining hash tables

In practice, collisions are unavoidable. Instead of trying to avoid collisions by assigning a unique number to each element, separate chaining hash tables handle collisions by replacing the boolean[] present with an array of buckets, where each bucket can store zero or more elements.

Since separate chaining addresses collisions, we can now use smaller arrays. Instead of directly using the hash code as the index into the array, take the modulus of the hash code to compute an index into the array. Separate chaining turns the process from a single step to multiple steps:

Call the hash function on the given element to get a hash code, aka the number for that element.
Compute the bucket index by taking the modulus of the hash code by the array length.
Search the bucket for the given element.

Separate chaining comes with a cost. The worst case runtime for adding, removing, or finding an item is now in Θ(Q) where Q is the size of the largest bucket.

In general, the runtime of a separate chaining hash table is determined by a number of factors. We’ll often need to ask a couple of questions about the hash function and the hash table before we can properly analyze or predict the data structure performance.

What is the hash function’s likelihood of collisions?
What is the strategy for resizing the hash table.
What is the runtime for searching the bucket data structure.

To make good on our promise of a worst case constant time data structure for sets and maps, here’s the assumptions we need to make.

Assume the hash function distributes items uniformly.
Assume the hash table resizes by doubling, tripling, etc. upon exceeding a constant load factor.
Assume we ignore the time it takes to occasionally resize the hash table.

Counting Sorts

MSD.java LSD.java

Explain the worst-case lower bound for comparison sorting.
Describe counting sort, MSD radix sort, and LSD radix sort.
Explain how the subsort is used in MSD and LSD radix sort.

In practice, Java’s system sorts like Timsort and dual-pivot quicksort have a linearithmic order of growth. Is it possible to sort in faster than worst case Θ(N log N) time? In the best case, we know that sorting algorithms like insertion sort can tell that an already-sorted array is indeed sorted in linear time. But can we design an algorithm that sorts an array of N elements in linear time in the worst case?

Sorting decision tree

We can start by asking, “How many comparisons do we need to make in order to know how to sort an array?” We’ve actually already asked this question before: in merge sort, we relied on knowing that a single element subarray is already sorted with respect to itself. In other words:

Sorting an array with 1 element requires 0 comparisons.
Sorting an array with 2 elements requires 1 comparison.
Sorting an array with 3 elements requires either 2 or 3 comparisons depending on the questions.

We can draw a sorting decision tree to see exactly what questions we need to ask to determine the sorted order for 3 elements. Each leaf in the decision tree represents a possible sorted order for the elements, and each branch represents a choice of answering either “Yes” or “No” to each comparison question.

Sorting decision tree for 3 elements

This decision tree is not only a conceptual visualization, but it can also be implemented as a program consisting of a hierarchy of nested if conditional statements. This decision tree represents the optimal comparison sorting algorithm: a comparison sort that requires the absolute least number of comparisons to sort a given input array.

if (a < b)
    if      (b < c) return {a, b, c};
    else if (a < c) return {a, c, b};
    else            return {c, a, b};
else
    if      (a < c) return {b, a, c};
    else if (b < c) return {b, c, a};
    else            return {c, b, a};

If 3 elements have 6 permutations, how many permutations are there for 4 elements?

For each of the 6 permutations in a, b, c, we can insert the fourth element d before, in-between, or after each element. For example, if we consider the permutation {a, b, c}, we can insert d in 4 different places: {d, a, b, c}, {a, d, b, c}, {a, b, d, c}, and {a, b, c, d}. Ultimately, we take the 6 permutations we had for 3 elements and multiply by 4 to get 24 total permutations for 4 elements. More generally, the number of permutations can be described using the factorial function: 4! = 4 ∙ 3 ∙ 2 ∙ 1.

If N elements have N! factorial potential permutations, and each potential permutation is a leaf in a balanced sorting decision tree, then the optimal comparison sorting algorithm in the worst case needs about log₂ N! comparisons to determine the correct sorting of the elements. Stirling’s approximation can be used to show that log₂ N! ∈ Θ(N log N).

In other words, the optimal comparison sorting algorithm requires Θ(N log N) comparisons in the worst case. It’s not possible to design a comparison sorting algorithm that takes linear time in the worst case.

Counting sorts and enumeration

The worst case lower bound on comparison sorting algorithms only apply to comparison sorts that use operations like <, >, or compareTo to determine the order of elements. A counting sort sorts an array of elements by relying on enumeration instead of comparison. Elements are considered comparable if they can be compared with one another. Elements are considered enumerable if they can be listed-out in a sequence from first to last.

Counting sort

Create a count array that will be used to store the number of times each element occurs in the input array.
Iterate over the input array, updating the count array to reflect the occurrence of each element.
Iterate over the count array, unraveling the number of times each element occurs back into the input array.

Open the VisuAlgo module to visualize sorting algorithms. Press Esc to exit the e-Lecture Mode, and choose COU from the top navigation bar to switch to counting sort. Run the sorting algorithm using Sort from the bottom left menu.
Sorting Visualization

Radix sorts

Counting sorts work great with small integer values when we know the range between the smallest integer and the largest integer. But many other data types, like strings, are more difficult to enumerate because we can always make a more complicated string. For example, suppose we have a string “a” and we decide to put it at index 0 in the count array. Where would the string “b” belong in the count array? We know it comes after “a”, but how far after “a”? We might run into strings like “aa”, “aaa”, “aaaa”, etc. and not know how many spaces to reserve for these elements.

To address this issue, we can take inspiration from tries. Just as tries divided a string into its constituent letters and processed each letter individually, we can do the same and apply counting sort on each letter. Radix sorts represent a category of counting sorts that divide strings (or string-like objects) into individual subunits that can be separately counting-sorted.

Most-Significant Digit (MSD) radix sort: Starts from the leftmost (in English, the most significant) character and proceeds to the right.; Recursively counting sorts the characters separately, proceeding to the next index into the strings.
Least-Significant Digit (LSD) radix sort: Starts from the rightmost (in English, the least significant) character and proceeds to the left.; For each index into the strings, iteratively counting sorts all the elements again on the current index.

Open the VisuAlgo module to visualize sorting algorithms. Press Esc to exit the e-Lecture Mode, and choose RAD from the top navigation bar to switch to an LSD radix sort. Run the sorting algorithm using Sort from the bottom left menu.
Sorting Visualization

3-way radix quicksort

Is there a sorting algorithm analogy for ternary search trees? It exists, and it combines the ideas of radix sort with quicksort just like how ternary search trees represent a midpoint between tries and binary search trees.

3-way radix quicksort: Select a pivot element for the current index into the strings.; Partition the array into elements less than, equal to, and greater than the pivot element.; Recursively sort each of the less than, equal to, and greater than subarrays.

Binary Heaps

MinPQ.java Heap.java

Apply sink/swim operations to trace heap element insertion and removal.
Identify possible binary heap indices for the n-th smallest value.
Given an array index, find the parent, left child, and right child indexes.

Compared to binary search trees, 2-3 trees and left-leaning red-black trees provided two solutions to avoiding worst case height. But neither a 2-3 tree nor a left-leaning red-black tree maintain a perfectly-balanced binary search tree. A 2-3 tree maintains perfect balance, but needs 3-child nodes. A left-leaning red-black tree is a binary search tree, but it doesn’t maintain perfect balance: in the worst case, the left side can be up to double the height of the right side.

How do we even define perfect balance? One definition is called completeness.

Complete tree: A tree where every level, except possibly the last level, is completely filled. If the last level is not completely filled, all nodes must be as far left as possible.

It’s not easy maintaining a complete binary search tree: a tree that simultaneously satisfies all three of the definitions for a complete tree, a binary tree, and a search tree. In the worst case, adding a new element might require moving all its elements to new places.

Adding to a complete binary search tree

Of the tree data structures that we’ve studied, our best approaches only satisfy two out of three properties:

2-3 tree: A tree data structure that satisfies the definitions for a complete tree and a search tree, but it is not a binary tree.
LLRB tree: A tree data structure that satisfies the definitions for a binary tree and a search tree, but it is not a complete tree.

A binary heap is the final option in the 3-choose-2 data structure design.

Binary heap: A tree data structure that satisfies the definitions for a complete tree and a binary tree, but it is not a search tree.

What can we do with a binary tree without the search tree property? 2-3 trees and LLRB trees provided efficient implementations for sets and maps because of the combination of their search tree property and height invariants. Binary heaps instead implement a different abstract data type called priority queue.

Priority queue abstract data type

The priority queue is an abstract data type where elements are organized according to their associated priority values. Priority queues have direct real world applications. For example, they can be used to triage patients in a hospital emergency room. Rather than serving patients first-come, first-served (as in a regular queue), a priority queue can be used to ensure patients with the most severe or time-sensitive conditions are treated first even if someone else arrived earlier.

The min-oriented priority queue MinPQ interface supports 3 important operations:

void add(E element, double priority): Adds the given element with the given priority value.
E peekMin(): Returns the element with the minimum priority value.
E removeMin(): Returns and removes the element with the minimum priority value.

Likewise, the max-oriented priority queue MaxPQ could be defined with methods that allow access to the element with the maximum priority value. Priority queues differ from sets in two ways.

Multiple elements can share the same priority value. For example, two patients can be equally in need of care. Ties between priority value can go either way because either one is an element with the minimum (or maximum) priority value.
In some implementations, duplicate elements are allowed. This doesn’t make sense for the emergency room application since you can’t have two copies of a person, but we’ll later see some algorithms that rely on storing duplicates.

Heap invariant

To implement a priority queue, binary heaps maintain a heap invariant that depends on whether the heap implements a MinPQ or a MaxPQ.

Min-heap invariant: The priority value of each node must be less than or equal to the priority values of all its children.
Max-heap invariant: The priority value of each node must be greater than or equal to the priority values of all its children.

For simplicity, our visualizations will only show the priority value. Implementations of the priority queue abstract data type typically require not just the priority value, but also the element associated with the priority value.

Implementing peekMin just requires returning the overall root element because the min-heap invariant ensures that the element with the minimum priority value will be stored at the very top of the tree.

Implementing removeMin, however, requires more work to maintain the completeness property.

Swap the root with the last leaf.
Remove the last leaf.
Sink the new root to its proper place, promoting the lower-priority child.

Heaps defined two private helper methods called sink (percolate down) and swim (percolate up) that are used to restore heap invariants after the removal or addition of an element (respectively). The sink method repeatedly swaps the current node with the lower-priority child until heap invariants are restored. The swim method repeatedly swaps the current node with its parent until heap invariants are restored.

Finally, the add method can be implemented by adding the element to the next open position that maintains completeness before swimming the element to restore heap invariants.

Array representation

Despite all of this work, it turns out that binary heaps are not any more asymptotically efficient than using a balanced search tree like a 2-3 tree or a left-leaning red-black tree. In practice, the main advantage of using a binary heap to implement a priority queue is due to a way that we can represent the tree using an array.

Array representation is the default and assumed representation for a binary heap.

Node representation: Explicitly maintains tree structure through a hierarchy of references.; Only maintains parent-to-child references, which makes swim challenging to efficiently implement.
Array representation: Implicitly maintains tree structure through a mapping between array indices and tree location.; Both parent-to-child and child-to-parent indices can be computed using arithmetic.

The following slides and visualizations show a binary max-heap where the heap is organized around access to the maximum element at the top of the heap.

Open the VisuAlgo module to visualize binary max-heap operations. Press Esc to exit the e-Lecture Mode. Choose ExtractMax() from the bottom left menu and select 1x (Once) to see the result of removing the element associated with the maximum priority value. The red number under each node represents the index in the array representation of the tree.
Binary Heap Visualization