I first asked people to consider a series of structures we have looked at to think of how expensive it would be to perform three specific operations:
Structure | add | find | remove | notes |
---|---|---|---|---|
unsorted array | O(1) | O(n) | O(n) | remove is expensive only because you have to first find the value |
sorted array | O(n) | O(log n) | O(n) | add and remove are expensive because you have to shift values |
unsorted linked list | O(1) | O(n) | O(n) | remove is expensive only because you have to first find the value |
sorted linked list | O(n) | O(n) | O(n) | add and remove are expensive because you have to find the right spot |
binary search tree | O(log n) | O(log n) | O(log n) | assuming tree is balanced |
Obviously the well-behaved structure here is the binary search tree because everything becomes O(log n). That's why the Java class libraries include tree implementations of sets and maps. But it turns out you can do even better. With a structure known as a hash table, you can get O(1) for each of these operations.
First you need some way to turn your data into an int. The function that does this is known as your hash function:
hash function: data --> intIn Java, every object has a method called hashCode() that does this. It is built into Java whether you define it or not. Some classes have specialized hash functions. The String class, for example, overrides the hashCode method with a hash function that is particularly effective for Strings.
One of the other basic ideas in hashing is to have a table that is somewhat "roomy" relative to the data you are going to include. There is a special value known as the load factor (sometimes referred to as lambda) that indicates how full the table is. A typical value for load factor would be 0.5, indicating that the table is half full. So if you wanted to have 5,000 values in the table, you'd make the table be 10,000 long.
A hash table is typically allocated as an array. So imagine allocating an array with 10,000 locations for storing our 5,000 values:
+---------+ [0] | | +---------+ [1] | | +---------+ [2] | | +---------+ [3] | | +---------+ [...] | ... | +---------+ [9999] | | +---------+Suppose that we are including Strings and we want to put the String "Reges" into the table. We use the hash function to turn the String into an int. You can ask Java to tell you the value of:
"Reges".hashCode()I did that before lecture and found out that it is 78839842. We take this and mod it by the size of our table to find a location (78839842 % 10000, which equals 9842). So we put the value in that location:
+---------+ [0] | | +---------+ [1] | | +---------+ [2] | | +---------+ [...] | ... | +---------+ [9842] | "Reges" | +---------+ [...] | ... | +---------+ [9999] | | +---------+Later, if someone asks me to find whether "Reges" is in the list, I again use the hash function and figure out that if it's in the list, it would have to be at location 9842. This allows me to quickly go to the exact spot in the list where it should be.
Of course, it's not quite that simple. Why? Because two values might end up going into the same spot. Some other String might also end up belonging in array position 9842. That is known as a collision and a lot of work has been done to figure out how to resolve collisions. One way to resolve the collision is to keep a linked list for each array value that has a list of all the values that went to that particular spot in the array. In other words, our array becomes an array of linked lists. This technique is called separate chaining and it is the technique used by Java's HashMap and HashSet classes.
So is hashing guaranteed to perform well? Not if you have a bad hash function. For example, suppose that your hash function turns all of your data into the int 42. That is technically a hash function because it converts data into an int, but because everything goes to 42, the hash table falls apart. With separate chaining, we end up with a really long linked list at array index 42, which means we degenerate to the poor performance of an unsorted linked list.
But in practice if you have a hash function that spreads things out and you have a good collision resolution strategy, you will end up examining only a few items on average (on the order of 2 or 3 values).
We then spent a few minutes looking at a "quick and dirty" implementation of HashSet that competes favorably with Java's built-in HashSet. We saw that both HashSet and TreeSet are considerably faster than what you could get with an ArrayList and that HashSet is somewhat faster than TreeSet.