This page is from a previous offering of the course. It has been left up for archival purposes.
Hash tables are the second data structure we’ll discuss for the dictionary ADT. With AVL trees we required that the keys are comparable, meaning we are able to perform > and < comparisons. We typically refer to a dictionary requiring this assumption as an ordered dictionary. Our hash tables will not require such an assumption, and therefore we call this type of dictionary an unordered dictionary.
1 Big Array Dictionary Data Structure
The overall idea of a hash table is to build a dictionary data structure that behaves like one really big array, but in reality is a relatively small array. If we have no regard for the amount of memory we can use for our data structures, the following would be a dictionary data structure that could perform all dictionary operations in constant time.
We will store all of our key-value pairs in a an array. Because all computer data can be represented using 1s and 0s, and sequences of 1s and 0s can represent numbers, we can represent each key as a number. Our array will therefore have enough indices such that every possible key is an index in this array. Our dictionary operations can then be defined as follows:
- insert: Considering the given key as an index in the array, set the value at that index to be the one given. In code, this would be something like
arr[(int) key] = (key,value). - find: Considering the given key as an index in the array, return the value found at that index. In code,
return arr[(int) key].value. - delete: Considering the given key as an index in the array, set the value at that index to be null. In code,
arr[(int) key] = null.
Now let’s see exactly how big our array would need to be in order to implement this data structure. To do this, we’ll assume that our keys and values both can be represented using one 64-bit word (the word size for most modern processors). This means that the length of our array will need to be 2^{64} indices of 64 bits each, or 64\cdot 2^{64}= 2^6\cdot 2^{64}=2^{70} bits, which 2^{67} bytes or 128 exabytes of storage. To put this into perseptive, 1 exabyte in approximately the total amount of data that has been collected cumulatively by the Large Hadron Collider for its entire existance (as of 2025, so 17 years). Clearly, this is just too much to be reasonable.
The goal of a hash table is to mimic the behavior of this gigantic array, but have it use substantially less data. Hash tables will be similar to the big array data structure in the following ways:
- We will store key-value pairs in an array
- We will somehow convert keys into numbers to be used to select an index of the array
But importantly, here’s how it will be different:
- The size of the big array depended on the number of possible keys, the size of the array in a hash table will depend on the number of keys actually used.
- We will map keys to integers using something called a hash function, which should behave as if we’re using the key to select an integer at random.
- The integer we map to may not necessarily be a valid index of the array, so we will mod the selected integer by the array’s length to obtain an index.
- The number of keys will be greater than the number of indices in the array, and therefore we may have multiple key-value pairs mapping to the same index (due to the pigeonhole principle), so we’ll need to have a strategy to account for this.
2 Hash Table Ideal
At this point, we’ll discuss the overall idea for how a hash table will work. There are still plenty of details that we’ll need to fill in, but it will be helpful to see the most basic intention for their behavior as a starting point. For this basic overview, we will ignore the possibility of collisions (multiple keys using the same index) in order to show the ideal
behavior. All of these operations will need to change once we being accounting for collisions.
A hash table will store key-value pairs in an array. It assumes that we have a hash function, which serves to map keys to an integer. The dictionary operations will behave as follows:
- insert: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index, set the value at that index to be the one given. In code, this would be something like
arr[hash(key)$arr.length] = (key,value). If the size of the hash table reaches a fixed ratio with the length of the array, resize the array, and re-insert all key-value pairs. - find: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index, return the value found at that index. In code,
return arr[hash(key)%arr.length]. - delete: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index, set the value at that index to be null. In code,
arr[hash(key)%arr.length] = null.
This should hopefully appear to you to be quite similar to the opertions for the big array data structure, with the main changes being:
- We use the hash function to pick a number and then modulus to pick an index for our hash tables (as opposed to somehow directly using the key itself as an index).
- We will occasionally need to resize the array as we add more key-value pairs. This is because the changes of a collision will go up as more indices of the array are occupied, and so we will resize to keep that probability low.
These small differences are incredibly important. In the big array we always need one index per potential key, regardless of how many keys we actually used. For hash tables we can accommodate any length of the array. The result is that we can use the number of keys actually present in the hash table to select whatever array length gives the best tradeoff for running time vs. memory usage.
3 Designing Hash Functions
Whenever our hash table’s behavior matches the ideal
above, our performance should match that of the big array. That is, if we never have collisions then all operations run in constant time. For this reason, we must take care when designing our hash functions so that they minimize how frequently we have collisions. Remember, it is impossible for us to prevent collisions entirely (because the whole goal is to have fewer indices than keys), but we can take action to make collisions as unlikely as possible.
Because we’re not going to know exactly what keys we might see when designing our hash function, the safest way to design our hash function to minimize the number of collisions is to make it look like hash(key)%arr.length selects an index of our array completely at random. We cannot literally select a random index, though, because we need to make sure we always select the same index for every key. Wwhen we insert a key-value pair at some index, we need to make sure look at the same index to find it later. If we simply use Math.Random to select an index, then the next time we need an index for the same key we will get a different random value.
With all of this in mind, we want a hash function to have all of the following properties:
- Consistent: Equal keys should hash to the same integer
- To be consistent, it must also be deterministic
- Uniform: Should use every index of a fixed-sized array, and use each index at a roughly equal rate
- Effective: It should behave as if the selection of an integer was random. This can be achieved by satisfyig all of:
- It is hard to find two unequal keys which hash to the same integer
- Given a key, it is hard to find another unequal key which hashes to the same integer
- The
avalanche effect
: making even a small change to a key results in an arbitrary change to the integer it hashes to
- Efficient: The hash function itself should not take much time to run
3.1 Good and Bad Examples
Let’s look at some examples of potential hash functions for strings, and discuss which of these properties they do/do not have. We will assume that each string s has n characters, where s_i is the ascii encoding of the ith character in the string.
To be a good
hash function, the function must satisfy all of the properties above.
Option 0: Use a random character
For this hash function, we will use h_0(s)=s_{random.nextInt(n)}.
- Consistent?: No! If we run the hash function twice we may get a different result each time. For example, the first time we run h(\text{"silent"}) we might randomly select the index 0, and so the output would be 115 (the ascii encoding of the character
s). The next time we run it we may randomly select the index 3, and so the output would be 101 (the ascii encoding of the charactere). Because running the hash function twice on the same key gave different results, it is not consistent.- Uniform?: No! Some characters are more common than others, so we’re more likely to hash to the ascii values of more common characters.
- Effective?: No! All strings that are just the same character repeated multiple times will be different, but always hash to the same value. For example, the strings
a,aa, andaaaaaaalways hash to the same value, even though they are different strings. It’s easy to find more examples (e.g.bwithbbbbb, etc.).- Efficient?: Yes! this will be a very fast operation.
Option 1: Use the first character
For this hash function, we will use h_1(s)=s_i.
- Consistent?: Yes! If we apply the hash function to the same string over and over again, we will always get the same result
- Uniform?: No! Some characters are more common than others, so we’re more likely to hash to the ascii values of more common characters.
- Effective?: No! Any two strings which start with the same character will hash to the same value. For example, the string
bananahashes to the same value as the stringbagel.- Efficient?: Yes! this will be a very fast operation.
Option 2: Sum the characters
For this hash function, we will use h_2(s)=\sum_{i=0}^{n-1} s_i.
- Consistent?: Yes! If we apply the hash function to the same string over and over again, we will always get the same result
- Uniform?: Probably not. This one is harder to see, but words formed by different combinations of common letter are more likely to appear.
- Effective?: No! Any two strings which contain the same characters (i.e. anagrams) will hash to the same value. For example, the string
silenthashes to the same value as the stringlisten.- Efficient?: Yes! this will be a very fast operation.
Option 3: Sum the characters multiplied by a power of a prime number
For this hash function, we will use h_3(s)=\sum_{i=0}^{n-1} s_i \cdot 31^i.
- Consistent?: Yes! If we apply the hash function to the same string over and over again, we will always get the same result
- Uniform?: Yes! As long as the array’s length is not exactly 31, we will use all indices, and in equal proportion.
- Effective?: Yes! Changing any character at any index will change the value we hash to in an unpredictable way.
- Efficient?: Yes! This is actually pretty efficient. It may not seem so at first due to the exponentiation of 31, but this could be computed efficiently by repeatedly doing
sum = (sum+s[i])*31for each character in the string.
Option 4: Sum the characters multiplied by a power of a prime number, double the result
For this hash function, we will use h_4(s)=2\cdot h_3(s).
- Consistent?: Yes! For the same reason option 3 is.
- Uniform?: No! we never use odd indices.
- Effective?: Yes! For the same reason option 3 is.
- Efficient?: Yes! For the same reason option 3 is.
Option 5: First alphabetize the characters, then apply option 3.
For this hash function, we will sort the string by character, for example, the string
bananawould becomeaaabnn, then we apply h(3) to that.
- Consistent?: Yes! For the same reason option 3 is.
- Uniform?: Yes! For the same reason option 3 is.
- Effective?: No! Any two strings which contain the same characters (i.e. anagrams) will hash to the same value. For example, the string
silenthashes to the same value as the stringlisten.- Efficient?: Yes! For the same reason option 3 is.
3.2 Design Considerations
When designing your hash functions, here are some things to keep in mind.
- To be consistent, you should never use aspects of your keys which are not relevant for determining equivalence. Because we need equal keys to give the same result, any aspect (e.g. field) of your key object which is not used in a
.equalsmethod should not be used in the hash function. Otherwise, this means we can make a change to the object which does not make it different according to.equals, but causes it to hash to a different value - To be effective, you should use all aspects of the keys which are relevant for determining equivalence. If you exclude some aspect of your key which is used in a
.equalsmethod, then we can make a change to that aspect to change the key without changing the value that is hashes to. - Use prime numbers. Because of number theoretic properties that we won’t get into, prime numbers result in better uniformity of the values that we hash to.
4 Collision Resolution
While a good has function will minimize the odds of a collision (i.e. two different keys mapping to the same index after applying the hash function), we will never be able to avoid collisions completely (unless we’re using exabytes of data, I suppose). For this reason, we must have some strategy for resolving collisions. In general, there are two major approaches to this: Separate Chaining and Open Addressing.
4.1 Separate Chaining
A separate chaining hash table accounts for collisions by using an array of linked lists rather than an array of key-value pairs. The idea is that instead of storing just one key-value pair at each index, it will store a list of key-value pairs at each index. This way if multiple keys map to the same index we can just store all of those keys at this index.
The operations for a separate chaining hash table would be as follows:
- insert: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index. If the linked list at that index contains the key already, update the value. Otherwise, add the key-value pair to the linked list. If the size of the hash table reaches a fixed ratio with the length of the array, resize the array, and re-insert all key-value pairs.
- find: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index. Check for the key in the linked list at that index, and return the corresponding value if present.
- delete: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index, remove the key-value pair from the linked list at that index.
4.1.1 Why Linked Lists?
We use a linked list data structure at each index of the array. Each of these linked list data structures is actually being used as if it were a dictionary data structure because the hash table insert involves an insert operation on the linked list, hash table find involves a find operation on the linked list, and hash table delete involves a delete operation on the linked list. In general, we could use any dictionary data structure at each index of the separate-chaining hash table (array list, binary search tree, AVL tree, another hash table etc.), so why a linked list? Recall that The goal of using a good hash function and occasionally resizing the array is to make collisions rare. This means that a linked list makes the most sense because:
- Very often our per-index data structures will be empty, so we want a data structure that does not use much excess space (this excludes array-based data structures like array lists or hash tables)
- When not empty, our per-index data structure should never contain a large number of elements, so we do not care about the asymptotic running time (so AVL trees or other complex but asymptotically efficient data structures are not helpful)
In summary, linked lists have the best combination of being memory efficient and time efficient for empty or small dictionaries, making them the best choice.
4.2 Open Addressing
An open addressing hash table (also known as a probing hash table) stores key-value pairs directly in an array of key-value pairs, and accounts for collisions by using a strategy called probing
to find an alternative index to store a key-value pair whenever the mapped-to index is occupied. To probe simply means to check sequence of alternative indices
The operations for an open-addressing hash table would work as follows:
- insert: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index. If the hashed-to index is empty, put the key-value pair at that index. If the hashed-to index contains that key, update the value. If the index contains a key-value pair with a different key then begin probing to check alternative indices. For each index while probing:
- If the index contains the given key, update the value and end the probe
- If the index is empty, add the given key-value pair at that index and end the probe
- Otherwise continue probing
- find: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index. If the hashed-to index contains the given key, return the value. If the index is empty, the key was not present. If the index contains a key-value pair with a different key then begin probing to check alternative indices. For each index while probing:
- If the index contains the given key, return the value and end the probe
- If the index is empty, the key was not present, end the probe
- Otherwise continue probing
- delete: This one is a bit complicated, we’ll discuss it below.
4.2.1 Probing Patterns
A probing pattern refers to which sequence of indices to pick as the alternative. We get different performance behaviors for different choices of probing pattern.
4.2.1.1 Linear Probing
Linear probing is the simplest probing pattern. Suppose that we want to do a find/insert/delete operation for a given key k, and so we compute h(k) % arr.length to be index i. As long as we see collisions, we will follow the pattern: i, i+1, i+2, etc. This means that the jth index we check in the probe sequence is index i+j. In other words, we just check the very next index in the arr as we continue probing.
The primary downside of linear probing is that the running time scales poorly as the hash table fills with more key-value pairs due to a phenomenon called primary clustering. Consider that we have a value at index i of the hash table. If we have a collision at index i when inserting, that new item will be added at index i+1. Now if we have an insert that hashes to index i or i+1 then that item will go at index i+2. At this point there are 3 different places that cause a new key-value pair to be placed at index i+3 (those being i, i+1, or i+2).
Here’s why this is important. Observe that when we have a continiguous segment of our array that is filled with key-value pairs, an insert which hashes to any index in that segment guarantees that this new insert will occur at the index immediately following this segment, thereby growing the size of the filled segment. As a consequence of the segment growing in size, future inserts have a higher likelihood of hashing to an index within this segment. This produces a feedback loop where as a segment grows, we more likely to hash to an index with the segment, which grows the segment, which increases the odds that we hash to within the segment, which means we’re more likely to hash to within that segment again. Overall this means that as our table gets more full, our running time may increase dramatically due to this clustering of keys.
To help address this primary clustering concern, we could use an alternative probing pattern where we take large steps away when probing so that we hopefully space apart our key-value pairs when adding them, so we have better running times as our hash table fills. Quadratic probing is one such way to do this.
4.2.1.2 Quadratic Probing
Suppose that we want to do a find/insert/delete operation for a given key k, and so we compute h(k) % arr.length to be index i. As long as we see collisions, we will follow the pattern: i, i+1^2, i+2^2, i+3^2, i+4^2, etc. This means that the jth index we check in the probe sequence is index i+j^2.
Notice that each time we need to check another index while probing we looking increasingly farther away. This behavior addresses the primary clustering issue we saw with linear probing. By taking large steps away in our probe path we are unlikely to extend the size of a cluster.
Quadratic probing, while not susceptible to primary clustering, is still susceptible to secondary clustering. Even if two differen keys hash to different values, they may map to the same index after we mod by the lenght of the array. For quadratic probing (as with linear probing), two keys which map to the same index will follow the same probe path. We may get better performance if different keys were more likely to follow different probing paths. To achieve this, we could consider double hashing.
4.2.1.3 Double Hashing
For this option we will need two hash functions. We will call the primary hash function h and the secondary hash function g (it may be that g is derived from h). Suppose that we want to do a find/insert/delete operation for a given key k, and so we compute h(k) % arr.length to be index i. As long as we see collisions, we will follow the pattern: i, i+g(k), i+2g(k), i+3g(k), i+4g(k), etc. This means that the jth index we check in the probe sequence is index i+j\cdot g(k).
Because g and h are both good hash functions, it is very unlikely that both hash different keys to the same values. This resolves secondary clustering because even when we have two keys map to the same index using h, they are very unlikely to follow the same probe path, since that depends on g.
One issue we must consider when implementing double hashing, though, is that if g(k) is a multiple the length of the array, the probe path will not contain many distinct indices. For example, if g(k) == arr.length then i+j\cdot g(k)=i for all choices of j (the same goes for if g(k) == x*arr.length). For this reason we want to ensure that g(k) is not a multiple of arr.length. One way to do this is to say if g(k)==x*arr.length then use g(k)=1 instead.
5 Running Time Analysis
For our running time analysis we will consider a separate-chaining hash table for simplicity. Running time analysis for open addressing hash tables yields a similar result, but requires more probabilistic analysis to look at things like the expected length of a probing pattern, so we will omit this analysis.
Let’s begin by considering the time required for an unsuccessful find operation. We say a find operation is unsuccessful if the key given does not appear in the dictionary (and therefore the operation will return null or some other default value). We use an unsuccessful find for our analysis because this is the worst-case for a find operation (the running time of a successful find is less than or equal to that of an unsuccessful find), and insert and delete each require only a constant amount of work in addition to a find operation.
For a separate-chaining hash table, the find operation would behave as follows (copied from above):
find: Use the hash function to map the key to an integer, mod that integer by the length of the array to get an index. Check for the key in the linked list at that index, and return the corresponding value if present.
For an unsuccessful find, we must look at all of the key-value pairs in the linked list found at hashed-to index before returning the default value. This means the running time of find is linear in terms of the length of that linked list. In the worst case, it’s certainly possible that, for a hash table containing n key-value pairs, the length of the linked list is n. This means that the worst case running time for find is \Theta(n).
Provided we have a good hash function, however, this situation is exceptionally unlikely. For this reason, we would be better off looking at the expected running time to get an idea of the time find requires in practice. To find the expected running time we will begin by assuming we have a good hash function, meaning it will behave as though each key is assigned to a random index in the hash table. Now we can look at the expected (i.e. average probabilistically) length of a linked list in this separate chaining hash table.
Suppose that our hash table has n key-value pairs in it. Each of these key-value pairs is then randomly assigned to an index in the array of linked lists. Suppose that the lenght of this array is \ell. The expected size of each linked list will then be \frac{n}{\ell} (because the keys should be uniformly distributed for a good hash function). The value \frac{n}{\ell} represents both the expected size of each linked list, and also the average number of key-value pairs per index. We call this value the load factor of the hash table. The convention is to represent the load factor using the Greek letter lambda (\lambda).
Because the expected size of the linked list is \lambda, the expected running time of an unsuccessful find is also \lambda. Since our goal is to have our expected running time be O(1), we need \lambda = \frac{n}{\ell} to be constant, which requires \ell \leq c\cdot n for a constant c. In other words, the size of the array holding our linked lists must be linear in the size of the hash table.
As a consequence of this analysis, hash tables will behave in a way similar to array lists. In order to keep \lambda \leq c for a constant c, we will need to resize our array whenever \lambda > c. This hash table resizing operation is called rehashing.
5.1 Rehashing
When resizing an array list we created a new array whose length was twice that of the original, then copied all items from the old array into the same index of the new array. Because we performed a modulus operation on the hash of our keys to select the index to store each key-value pair at, when we change the size of the array the index for each key-value pair may need to change as well. Therefore we cannot simply move the key-value pair at index i of the old array to index i of the new array. Instead we will need to re compute the index by modding the hash by the new table’s length. The rehashing procedure is therefore:
- If `table.size / arr.lengh > c’ then do the following.
- create a new array whose length is at least
2*arr.length(see advice below for more on the length to select), call thisnewArr - for each key-value pair in the hash table re-insert that key-value pair at the index
h(k) % newArr.length - set
arr = newArr
6 Advice
Here are some general tips for implementing hash tables:
- Use prime numbers: whenever values used for the hash function share common factors with the length of the array we will lose some uniformity of our hash function. For this reason, it’s helpful to use prime numbers inside the hash function as well as for the lengths of the array to minimize the odds of this ocurring
- Because we want to use prime numbers as the length of our array, and computing prime numbers is expensive, use a pre-computed and hard-coded list of primes as a field in your data structure to use for rehashing. Similar to with array lists, we want the resized array’s length to be roughly double that of the original array.
- If you need the hash table’s size to be larger than the largest pre-computed prime, use values of the form 2^k-1 as your array length, as these values on average tend to have few prime factors (and in fact have a high chance of being prime).
- When rehashing, you should bound \lambda by the following constants:
- For a separate chaining hash table, keep \lambda < 2.
- For an open addressing hash table, keep \lambda < \frac{1}{2} (observe that we cannot mathematically have a load factor greater than 1 in an open addressing hash table because this would imply that there is more than 1 item per index on average, and so the pigeon hole principle requires at least one index would need to contain at least 2 key-value pairs, which is not possible).