Hashing - Hash Tables

Complete the Reading Quiz by 3:00pm before lecture.

In the last lecture’s reading, we introduced the idea of a hash function. A hash function gives us a way to convert some object or blob of data into a number that “represents” that data. We saw how there are important properties that make a hash function “good” – namely that it is deterministic, efficient, and uniform.

In this reading, we will see how hash functions can help us create highly efficient data structures by building toward the basic idea of a hash table. In lecture, we will talk about hash tables in depth and cover their implementation.

DataIndexedIntegerSet

Let’s say we’re trying to implement a Set of integers. We could use some of the tree data structures we’ve already seen, and get logarithmic runtimes for most of the operations we care about. But is it possible to be even faster?

We’ve actually already seen a data structure that enables faster-than-logarithmic time operations. Recall that indexing into an array is a constant-time operation, no matter the length of the array. We can use this idea for our Set. Instantiate a boolean array present of size 2 billion, where each element of the array corresponds to one possible item that could be stored inside. Each index of the array defaults to the value false representing that the set is empty and contains no integers.

add(int x): To add an integer to the set, assign present[x] = true in constant time.
contains(int x): Return the value of present[x] in constant time.

public class DataIndexedIntegerSet {
    private boolean[] present;

    public DataIndexedIntegerSet() {
        present = new boolean[2000000000];
    }

    public void add(int x) {
        present[x] = true;
    }

    public boolean contains(int x) {
        return present[x];
    }
}

While this implementation is simple and fast, it’s not suitable for real-world use due to a few problems:

This solution happens to work for integers because arrays are indexed by integers. Storing, say, a set of strings will require modifications to this approach.
Instantiating an array of size 2 billion requires massive memory space – and what if we want to store an integer outside that range, such as -1?

In the next section, we explore ways to address these problems.

DataIndexedEnglishWordSet

Let’s apply the data-indexing idea to English words represented in Java with the String data type. Suppose we want to add("cat"). From our previous lecture, we know that we can use a hash function to convert the string cat, or any string, to an integer.

Picking a Size

Having a way to convert non-numeric elements to integers is useful, but we still run into the problem of storing those integers. We can’t have an infinitely-large array; whatever size we choose for present, there will always be an index that’s too large to store. We’ll have to find a way to map multiple elements to our limited indices. This problem is analogous to the one we discussed in the hashing lecture: when our hash function generated a value outside of its range, we needed to use the modulo operator to bring it back inside its range. In other words, we can’t avoid collisions altogether

We need to find some way to reasonably deal with the inevitable collisions, though we can adjust the size of our present array to make them as rare as possible. We’ll explore ways to deal with collisions and how to choose an array size in lecture.

Reading Quiz