Hashing
Table of contents
Balanced search tree data structures have two fundamental limitations.
- Items in a search tree need to be comparable. In order to decide where a new item goes in a search tree, we have to answer the question, “Is the item less-than or greater-than the root?” For some objects, this question may make no sense.
- Runtime complexity of these data structure methods are typically logarithmic time. This is significantly faster than linear time, but could be improved.
Data-indexed arrays
- Idea
- Indexing into an array is a constant-time operation no matter the length of the array.
We can use this idea to implement a Set of integers. Instantiate a boolean array present
of size 2 billion. Each index of the array defaults to the value false
representing that the set is empty and contains no integers.
add(int x)
- To add an integer to the set, assign
present[x] = true
in constant time. contains(int x)
- Return the value of
present[x]
in constant time.
public class DataIndexedIntegerSet {
private boolean[] present;
public DataIndexedIntegerSet() {
present = new boolean[2000000000];
}
public void add(int x) {
present[x] = true;
}
public boolean contains(int x) {
return present[x];
}
}
While this implementation is simple and fast, it’s not suitable for real-world use. For one, instantiating an array of size 2 billion requires a significant amount of memory and doesn’t even cover negative numbers. Furthermore, this solution happens to work for integers because arrays are indexed by integers. Storing a set of strings will require modifications to this approach.
Suppose we want to add “cat”. One way to determine where to store “cat” is to use the first character of the word where ‘a’ maps to 1, ‘b’ maps to 2, ‘c’ maps to 3, and so forth.
What's problematic about this approach?
There are many words that start with the character ‘c’. After adding “cat” to the set, contains("chupacabra")
will return true because present[3]
is true. We also don’t have a method for storing non-alphabetic or non-English strings.
Collisions
A collision occurs when two or more keys map to the same index. One way we can avoid collisions is to assign to each English word a unique integer representing all of the characters in the entire word, not just the first character.
Suppose the string “a” maps to 1, “b” maps to 2, “c” maps to 3, …, and “z” maps to 26. Then the string “aa” should map to (1 * 261) + (1 * 260) = 27, “ab” to 28, “ac” to 29, and so forth. Since there are 26 lowercase English characters, our base is 26.
Generalizing this pattern, we can compute the index for “cat” as (3 * 262) + (1 * 261) + (20 * 260) = 2074. This mathematical formula is known as a hash function. The result of hashing “cat” is the hash code 2074.
Convert the word "bee" into an integer index using the formula above.
Since the character ‘b’ is the second letter in the alphabet and ‘e’ is the fifth, we get (2 * 262) + (5 * 261) + (5 * 260) = 1487.
So long as we pick a base that’s at least 26, this algorithm is guaranteed to assign each lowercase English word a unique integer. This doesn’t include uppercase characters or punctuation. If we want to support other languages than English, we’ll need an even larger base. For example, there are 40,959 characters in the Chinese language alone. A char
in Java supports all of these characters and more: the range of possible characters is defined by a standard known as Unicode. Each Java char
is 16 bits wide, so our choice of base will need to be 216.
To make matters worse, not only does choosing such a large base result in impractical memory usage, collisions are unavoidable with Java arrays. The maximum size of an array is limited to the size of the largest Java int
, which is 2,147,483,647. There are more unique strings than unique integers, so collisions are inevitable!
Hash table
Hash tables improve upon data-indexed arrays by handling collisions in one of several possible ways. In this course, we will focus on separate chaining as a means of handling collisions.
Separate chaining replaces the boolean[] present
with an array of buckets containing zero or more items. Each bucket in the array is initially empty. When an item x is added to index h, add x to the bucket if it is not already present. In these examples, buckets are shown as linked lists.
Since separate chaining addresses collisions, we can now use smaller arrays. Instead of using the hash code directly, take the mod of the hash code to compute the index.
However, introducing separate chaining comes with a cost. The runtime for adding, removing, or finding an item is now in O(Q) where Q is the size of the largest bucket. Depending on the value of Q, the runtime for our hash table can be potentially worse than constant or even worse than log N, where N is the total number of items stored in the hash table.
contains(Object o)
- Compute the hash code of
o
mod the length of the array to get the bucket index. Then, search the bucket by callingequals
on each item.
Given a hash table with 5 buckets, give the order of growth of Q.
Q is in Theta(N). In the best case, all items are evenly distributed across all 5 buckets, so Q ~ N / 5. In the worst case, all items collide in a single bucket, so Q = N.
add(E element)
- Resize if N / M exceeds the load factor. Add the
element
to its bucket if it’s not already present.