In the last couple of assignments, we've used two different kinds of maps, TreeMap and HashMap. Both of these have essentially the same interface - we can store <key,value> pairs, and, given a key, we can retrieve the associated value if one is stored in the map. The main difference between the two is that when we asked the map to return a list of all keys, the TreeMap returned a sorted list, while the HashMap returned the keys in some apparently random order. Today we'll take a short look at how these are implemented, why they behave slightly differently, and what the tradeoffs are.
In both cases, we have a set of keys with no duplicates that we wish to be able to store and retrieve quickly. In the case of a map, we store a value along with the key, but the same issues come up if we just try to store a collection of items in a set.
A TreeMap (and the related library class TreeSet) both use balanced binary
search trees to store the keys. This provides guaranteed O(log n) add, remove,
and search operations. A restriction is that the objects used as keys have
to be comparable (using the compareTo
method) so decisions can be made about
where to store the objects in the tree and how to proceed while searching for
them.
(We'll revisit compareTo
next week when we look in more detail at how this
should be implemented.) A side effect of using a BST is that we can retrieve
a list
of the keys in sorted order with an inorder traversal of the tree.
TreeMaps give us good behavior compared to lists, which potentially can to search in O(log n) time, but can require O(n) time for add and remove. A natural question is whether we can do better, and the answer is yes - provided we are willing to give up the ability to retrieve the keys in sorted order. The method of choice here is hashing, which not only gives us O(1) operations (if we do it right), but also works with arbitrary objects, not just comparable ones.
The basic idea is fairly simple. Instead of using a tree to store the data, we'll use an array, often called a collection of "buckets". That gives us O(1) access to individual elements, provided that we can quickly figure out which array element, or bucket, holds the item we're interested in. To do that we use a hash function to compute a bucket number from the item itself.
The hash function works from the following observation: ultimately every object
is a collection of bits. While those bits might represent strings, colors,
rectangles, or players in a game, they are just bits, and we can always interpret
the bits as an integer or collection of integers. So what a hash function does,
informally, is take the bits that make up an object and use them to compute
an integer hash
code.
If we can do this quickly, O(1), we can use the code to select a bucket in
our hash table. The hash function can compute a bucket number directly,
or it can just return an arbitary integer, in which case we can use the remainder
operation (%) to calculate the bucket number: hashCode % nBuckets
.
(A side effect of all of this is that the objects will be scattered in the
hash table, which means
if we go sequentially through the table to access them, they are unlikely to
be sorted.)
There is at least one potential problem here: first, what if the hash computation gives us the same bucket number for two different items? In that case we have a collision, and we need a strategy for handling this when it happens. One that's easy to visualize is to use each bucket to store a list of all of the items that hash to that bucket. Then to find an item, we compute the bucket number and look for it in the list of items in that bucket. There are other strategies for handling collisions - see any good data structures book. Alas, we don't have time to cover them here.
The next issue is performance. The hash function itself, if implemented properly, gives us a hash code in O(1) time, and access to a bucket (an element of an array) is also O(1). So potentially, add, search, and remove in a hash table can be done in O(1) time. But this only works if each item is stored in a separate bucket, or, if at most there are a small number of items in each bucket. To get good performance, then, we need to be sure that the items in the set or map are scattered around so they don't all wind up in the same bucket, which would require a long search for an item, even after we know its hash code.
There are two things that need to be done right to be sure the items are spread out and that each bucket has a small number of items in it.
If we have both a good hash function, and a low load factor, then we do indeed get O(1) typical performance for the core operations of a hash table. This is the algorithm underlying Java's HashMap and HashSet data structures, which makes these a good choice if we want fast access based on key values and don't care whether the keys are stored in any particular order.