Hashing

Hashing
Generalizing DataIndexedEnglishWordSet: handling collisions, improving performance, and defining hash functions.
Kevin Lin, with thanks to many others.
1
Ask questions anonymously on Piazza. Look for the pinned Lecture Questions thread.

Unlike the theme of previous lectures where we questioned existing invariants and assumptions, this lecture will start off more on-rails to develop a new approach to implementing the Set and Map ADTs.

Feedback from the Reading Quiz
2
The reading introduced us to a data organization strategy based on constant-time array indexing. This idea led to two classes: DataIndexedIntegerSet and DataIndexedEnglishWordSet.

DataIndexedStringSet
3
Using only lowercase English words is too restrictive. We’d like a more general solution that can store strings with punctuation and numbers at least.

char Representation
The most basic character set used by most computers is called ASCII.

Each possible character is assigned a value between 0 and 127. “Printable” characters are between 33 and 126.
char c = 'D' is like char c = 68
4

DataIndexedStringSet
Since the maximum value for printable ASCII characters is 126, set the base to 126.
Let’s try to represent the ASCII strings “bee”, “2pac”, and “eGg!”.

bee126	=			(98 * 1262) + (101 * 1261) + (101 * 1260)	= 1,568,675
2pac126	= (50 * 1263) +	(112 * 1262) + (97 * 1261) + (99 * 1260)	= 101,809,233
eGg!126	= (98 * 1263) +	(71 * 1262) + (98 * 1261) + (33 * 1260)	= 203,178,213
5
Hash Function
Hash Code
?: What’s problematic about this approach?

Beyond ASCII
Java chars support other languages and symbols, such as Chinese.
There are 40,959 Chinese characters.

守门员40959 = (23432 * 409592) + (38376 * 409591) + (21592 * 409590) = 39,312,024,869,368
6
F
守门呗
守门员
守门呙
39,312,024,869,367
39,312,024,869,368
39,312,024,869,369
...
F
...
T
?: What’s a potential problem with this hash function?

Integer Overflow Collisions
In Java, the largest int is 2,147,483,647.
Going over this limit results in overflow, starting back over at the smallest int.
If there are more unique mappings than unique ints, then collisions will still occur!
7
int x = 2147483647;
System.out.println(x);
// 2147483647
System.out.println(x + 1);
// -2147483648
DataIndexedStringSet disi;
disi.add("melt banana");
disi.contains("subterrestrial anticosmetic");
// true: both strings hash to 839099497

Collisions are Inevitable
There are 4,294,967,296 unique Java ints.


Pigeonhole principle says that if there are more than 4,294,967,296 items, multiple items must share the same hash code.
8
Too Many Pigeons (BenFrantzDale/Wikimedia)
We can’t give each of the 10 pigeons a unique pigeonhole!

111239443
111239444
111239445
...
0
1
...
111239443
111239444
111239445
...
0
1
...
a
111239443
111239444
111239445
...
0
1
a
abomamora
...
Separate Chaining
Instead of storing a boolean, store a bucket of items at the given index.

Each bucket in our array is initially empty. When an item x gets added at index h…
If bucket h is empty, create a new list containing x and store it at index h.
If bucket h is already a list, add x to this list if it is not already present.

9
111239443
111239444
111239445
...
0
1
a
abomamora
adevilish
...
add("a")
add("a")
add("abomamora")
add("a")
add("abomamora")
add("adevilish")
add("a")
add("abomamora")
add("adevilish")
add("abomamora")
add("a")
add("abomamora")
add("adevilish")
add("abomamora")
contains("adevilish")
?: Why is it necessary to check if x is not already present in the bucket before adding x?




?: When would it not be necessary to check if x is already present in the bucket?

Separate Chaining Runtime
Worst case runtime will be proportional to length of longest list, Q.
10
Worst case time
contains(x)
add(x)
Bushy BSTs
Θ(log N)
Θ(log N)
DataIndexedSet
Θ(1)
Θ(1)
Separate Chaining DataIndexedSet
Θ(Q)
Θ(Q)
0


1487


2074


3097


111239443
111239442

cat
doc
bee
...
...
...
...
abomamora
adevilish
?: Why is the runtime for separate chaining in terms of Q, the length of the longest list?

Saving memory with Separate Chaining and modulus
Instead of using the raw hash code, take the modulus of the hash code to compute index.
11
0


1487


2074


3097


111239443
111239442

cat
doc
bee
...
...
...
...
abomamora
adevilish
0
1
2
3
4
5
6
7
8
9
cat
bee
doc
abomamora
adevilish
mod 10
?: Do items with the same hash code (collision) still collide after applying mod 10? What about items with different hash codes?




?: How does this change affect runtime? The length of the longest list, Q?

Hash Table
12

Hash Table
Data is converted by a hash function into an integer representation called a hash code.
The hash code is reduced to a bucket index with the modulo operator.
13
抱抱
1034854400
0
Object
Hash code
Index
Hash function
Modulo length
0
1
2
3
4
5
6
7
8
9
bee
抱抱
están
doc
الطبيعة
शानदार
포옹
Hash table

Hash Table Runtime
Good news. We use way less memory and support any String.
Bad news. Worst case runtime is now Θ(Q), where Q is the length of the longest list.
14
Worst case time
contains(x)
add(x)
Bushy BSTs
Θ(log N)
Θ(log N)
DataIndexedSet
Θ(1)
Θ(1)
Separate Chaining Hash Table
Θ(Q)
Θ(Q)
0
1
2
3
4
?: What’s a potential problem with saving memory by using the modulus idea?

For a hash table with 5 buckets, give the order of growth of Q, the length of the longest list, with respect to N, the total size.
15

Hash Table Runtime
Best case. All items are distributed evenly across 5 buckets, so Q ~ N / 5.
Worst case. All items collide in a single bucket, so Q = N.
Overall. Q ∈ Θ(N)
16
Worst case time
contains(x)
add(x)
Bushy BSTs
Θ(log N)
Θ(log N)
DataIndexedSet
Θ(1)
Θ(1)
Separate Chaining Hash Table
Θ(Q)
Θ(Q)
0
1
2
3
4
A
?: How can we improve hash table runtime?

Improving Hash Table Runtime
Even if items are distributed evenly, lists are of length Q = N / M. For M = 5, Q ∈ Θ(N).
How can we improve our design to guarantee that Q ∈ Θ(1)?
17
Worst case time
contains(x)
add(x)
Bushy BSTs
Θ(log N)
Θ(log N)
DataIndexedSet
Θ(1)
Θ(1)
Separate Chaining Hash Table
Θ(Q)
Θ(Q)
0
1
2
3
4
Q
Q1: How can we improve our design to guarantee that Q ∈ Θ(1)?

Hash Table Resizing
Borrow an idea from ArrayList: resize the number of buckets, M, at the same rate as N.
For example, when N / M ≥ 1.5, double the number of buckets, M.
N / M is the load factor, ensuring the average list length is never more than 1.5 items long.
18
Worst case time
contains(x)
add(x)
Bushy BSTs
Θ(log N)
Θ(log N)
DataIndexedSet
Θ(1)
Θ(1)
Separate Chaining Hash Table
Θ(Q)
Θ(Q)
0
1
2
3
4
A
?: How can we improve hash table runtime?

Hash Table Resizing
When N / M ≥ 1.5, double the number of buckets, M.
19
N = 0  	M = 4 	N / M = 0
0
1
2
3
N = 1  	M = 4 	N / M = 0.25
N = 2  	M = 4 	N / M = 0.5
N = 3  	M = 4 	N / M = 0.75
N = 4  	M = 4 	N / M = 1
N = 5  	M = 4 	N / M = 1.25
N = 6  	M = 4 	N / M = 1.5
16
20
13
7
3
11
0
1
2
3
4
5
6
7
resize M=8
Bucket
?: After resizing, where will the bucket go?




?: Fill in the resulting hash table after resizing.

After resizing, where will the bucket go?
20

Hash Table Resizing
When N / M ≥ 1.5, double the number of buckets, M.
21
0
1
2
3
4
5
6
7
resize M=8
Bucket
7
13
20
3
11
16
N = 6  	M = 8 	N / M = 0.75
A
0
1
2
3
16
20
13
7
3
11
N = 6  	M = 4 	N / M = 1.5
?: What is the best case order of growth of Q with respect to N?




?: What is the worst case order of growth of Q with respect to N?

Resizing Hash Table Runtime
Best case. All items are distributed evenly across M ~ N buckets, so Q ∈ Θ(1).
Worst case. All items collide in a single bucket, so Q ∈ Θ(N).

contains(x): Compute hash code of x, take modulus, search the list of items.
add(x): Resize if N / M exceeds the load factor. Add x if the table does not contains(x).
Most add operations will be Θ(Q), but some will be Θ(N).
If we choose to resize by doubling, tripling, etc. the runtime “on average” will be Θ(Q).
22
More detail on resizing in the future.

?: As the hash table designer, what can we do to avoid the worst case scenario? What can we do as a hash table user?

Regarding Even Distribution
Even distribution of item is critical for good hash table performance.
Both tables have load factor of N/M = 1, but the left table is much worse!
23
x
x
?: What’s the order of growth of Q with respect to N for each table?

24
Algorithms (Sedgewick, Wayne/Pearson)
public int hashCode() {
    return 17;
}
Is this a valid hash function?
Q
Q1: Is this a valid hash function?

Is this a valid hash function?
25

Defining Hash Functions
26
IntelliJ code generation demonstration.

?: IntelliJ generate code feature defines equals() and hashCode() methods together. Why?

hashCode Contract
27
0
1
2
3
16
20
13
7
3
11
11
3
Hash function
Modulo length
Search list
1
2
3
equals
equals
equals
We know that unequal items can return the same hash code.

?: Do equal items need to return the same hash code?