Hashing

Complete the Reading Quiz by noon before lecture.

We’ve explored several ways to implement the Set and Map abstract data type. We first examined using OrderedLinkedSet, reorganized the data into a Binary Search Tree, and then introduced self-balancing invariants to yield B-Trees and LLRB Trees.

However, there are some fundamental limitations that result from this approach.

Items in a search tree need to be comparable. How do we decide where a new item goes in a BST? We have to answer the question, “Is the item less-than or greater-than the root?” For some objects, this question may make no sense.
Runtime complexity of these data structure methods are typically logarithmic time, which is a great result and significantly faster than linear time methods, but could be improved.

DataIndexedIntegerSet

We’ve actually already seen a data structure that enables faster-than-logarithmic time operations. Recall that indexing into an array is a constant-time operation, no matter the length of the array.

We can use this idea to implement a Set of integers. Instantiate a boolean array present of size 2 billion. Each index of the array defaults to the value false representing that the set is empty and contains no integers.

add(int x): To add an integer to the set, assign present[x] = true in constant time.
contains(int x): Return the value of present[x] in constant time.

public class DataIndexedIntegerSet {
    private boolean[] present;

    public DataIndexedIntegerSet() {
        present = new boolean[2000000000];
    }

    public void add(int x) {
        present[x] = true;
    }

    public boolean contains(int x) {
        return present[x];
    }
}

While this implementation is simple and fast, it’s not suitable for real-world use due to a few reasons. For one, instantiating an array of size 2 billion requires ample memory space and doesn’t even cover negative numbers. This solution happens to work for integers because arrays are indexed by integers. Storing, say, a set of strings will require modifications to this approach.

DataIndexedEnglishWordSet

Let’s apply the data-indexing idea to English words represented in Java with the String data type. Suppose we want to add("cat").

What is the present array index for “cat”? One idea is to use the first character of the word, so ‘a’ maps to 1, ‘b’ maps to 2, ‘c’ maps to 3, and so forth.

What's problematic about this approach?

There are many words that start with the character ‘c’. After adding “cat” to the set, contains("chupacabra") will return true because present[3] is true. A collision occurs when two or more keys map to the same index.

We also don’t have a method for storing non-alphabetic or non-English strings.

Avoiding Collisions

One way we can avoid collisions is to assign each English word a unique integer representation based on the entire word, not just the first character. Suppose the English string “a” uniquely maps to 1, “b” uniquely maps to 2, “c” uniquely maps to 3, …, and “z” uniquely maps to 26. Then the string “aa” should map to 27, “ab” to 28, “ac” to 29, and so forth. If we generalize this mapping into a mathematical formula, we can compute the index for “cat” as (3 * 26²) + (1 * 26¹) + (20 * 26⁰) = 2074.

Convert the word "bee" into an integer index using the formula above.

Since the character ‘b’ is the second letter in the alphabet and ‘e’ is the fifth, we get (2 * 26²) + (5 * 26¹) + (5 * 26⁰) = 1487.

So long as we pick a base that’s at least 26, this algorithm is guaranteed to assign each lowercase English word a unique integer.

Reading Quiz