Link

Hashing Study Guide

Brute force approach. All data is just a sequence of binary digits (bits). Can treat key as a gigantic number and use it as an array index. Requires exponentially large amounts of memory.

Hashing. Instead of using the entire key, represent entire key by a smaller value. In Java, we hash objects with a hashCode method (a hash function) that returns an integer (32-bit) representation of the object (a hash code). Hash tables take the hash code modulo M to get a hash table index between 0 and M - 1.

Designing good hash functions. Requires a blending of sophisticated mathematics and clever engineering; beyond the scope of this course. If hashCode is known and easy to invert, an adversary can design a sequence of inputs that result in everything being placed in one bin. Or if hashCode is just plain bad, the same thing can happen. In this class (and in most real-world things), IntelliJ can generate a reasonably good hashCode for us.

Uniform hashing assumption. For our analyses below, we assumed that our hash function distributes all input data evenly across bins. This is a strong assumption and never exactly satisfied in practice.

Collision resolution. Two philosophies for resolving collisions discussed in class: separate chaining (aka external chaining) and open addressing. We’ll mainly focus on separate chaining.

Separate-chaining hash table. Key-value pairs are stored in a bin of M nodes. Searching or adding a new item both require potentially scanning through entire bin.

Resizing separate-chaining hash tables. Understand how resizing may lead to objects moving from one bin to another. Primary goal is so that M is always proportional to N, i.e. maintaining a load factor bounded above by some constant.

Performance of separate-chaining hash tables. Cost of an operation is given by the size of the bin that must be examined. With the uniform hashing assumption, we can say that “on average” the runtime for operations is N / M, which is no larger than some constant due to multiplicative resizing.

Recommend Problems

  1. [Adapted from Textbook 3.4.5] Is the following implementation of hashCode valid for any equals implementation?
    public int hashCode() {
        return 17;
    }
    
  2. In class, we gave the runtime for hash tables assuming the data structure for each bin are linked lists. It turns out that Java’s implementation of HashSet and HashMap sometimes converts bins from linked lists to balanced binary search trees. Why not always use balanced binary search trees? How does this affect the runtime analysis in the best case? Runtime in the worst case?

  3. Q1c from CS 61B 15sp MT2
  4. Q1b from CS 61B 16sp MT2
  5. Q1d from CS 61B 16sp MT2
  6. Q2b from CS 61B 17sp MT2
  7. Q2 from CS 61B 18sp MT2
  8. Q7 from CS 61B 15sp MT2
  9. Q5a, Q5b from CS 61BL 18su MT2