CSE143 Notes 5/19/06

Maps, Sets, and Hashing

In the last couple of assignments, we've used two different kinds of maps, TreeMap and HashMap. Both of these have essentially the same interface - we can store <key,value> pairs, and, given a key, we can retrieve the associated value if one is stored in the map. The main difference between the two is that when we asked the map to return a list of all keys, the TreeMap returned a sorted list, while the HashMap returned the keys in some apparently random order. Today we'll take a short look at how these are implemented, why they behave slightly differently, and what the tradeoffs are.

In both cases, we have a set of keys with no duplicates that we wish to be able to store and retrieve quickly. In the case of a map, we store a value along with the key, but the same issues come up if we just try to store a collection of items in a set.

A TreeMap (and the related library class TreeSet) both use balanced binary search trees to store the keys. This provides guaranteed O(log n) add, remove, and search operations. A restriction is that the objects used as keys have to be comparable (using the compareTo method) so decisions can be made about where to store the objects in the tree and how to proceed while searching for them. (We'll revisit compareTo next week when we look in more detail at how this should be implemented.) A side effect of using a BST is that we can retrieve a list of the keys in sorted order with an inorder traversal of the tree.

TreeMaps give us good behavior compared to lists, which potentially can to search in O(log n) time, but can require O(n) time for add and remove. A natural question is whether we can do better, and the answer is yes - provided we are willing to give up the ability to retrieve the keys in sorted order. The method of choice here is hashing, which not only gives us O(1) operations (if we do it right), but also works with arbitrary objects, not just comparable ones.

The basic idea is fairly simple. Instead of using a tree to store the data, we'll use an array, often called a collection of "buckets". That gives us O(1) access to individual elements, provided that we can quickly figure out which array element, or bucket, holds the item we're interested in. To do that we use a hash function to compute a bucket number from the item itself.

The hash function works from the following observation: ultimately every object is a collection of bits. While those bits might represent strings, colors, rectangles, or players in a game, they are just bits, and we can always interpret the bits as an integer or collection of integers. So what a hash function does, informally, is take the bits that make up an object and use them to compute an integer hash code. If we can do this quickly, O(1), we can use the code to select a bucket in our hash table. The hash function can compute a bucket number directly, or it can just return an arbitary integer, in which case we can use the remainder operation (%) to calculate the bucket number: hashCode % nBuckets. (A side effect of all of this is that the objects will be scattered in the hash table, which means if we go sequentially through the table to access them, they are unlikely to be sorted.)

There is at least one potential problem here: first, what if the hash computation gives us the same bucket number for two different items? In that case we have a collision, and we need a strategy for handling this when it happens. One that's easy to visualize is to use each bucket to store a list of all of the items that hash to that bucket. Then to find an item, we compute the bucket number and look for it in the list of items in that bucket. There are other strategies for handling collisions - see any good data structures book. Alas, we don't have time to cover them here.

The next issue is performance. The hash function itself, if implemented properly, gives us a hash code in O(1) time, and access to a bucket (an element of an array) is also O(1). So potentially, add, search, and remove in a hash table can be done in O(1) time. But this only works if each item is stored in a separate bucket, or, if at most there are a small number of items in each bucket. To get good performance, then, we need to be sure that the items in the set or map are scattered around so they don't all wind up in the same bucket, which would require a long search for an item, even after we know its hash code.

There are two things that need to be done right to be sure the items are spread out and that each bucket has a small number of items in it.

  1. The hash function itself needs to do a good job of calculating different (i.e., unique if possible) hash codes for different items. If all items have the same hash code, or use a small number of codes, there will be lots of collisions, and performance will suffer. Picking a good hash function is surprisingly subtle, but there are various rules of thumb about how to do it. In particular, we want to do a good job scrambling the bits in objects made of many components, and use "good" prime numbers for the calculations and for the number of buckets in the list.
  2. The other issue is that there need to be "enough" buckets to store the items. If we want to store 10,000 items and there are only 10 buckets, the best we can do is to have at least 1,000 items in some of the buckets - which means long, slow searches. Instead we want the load factor, the number of items divided by the number of buckets, to be quite low.

If we have both a good hash function, and a low load factor, then we do indeed get O(1) typical performance for the core operations of a hash table. This is the algorithm underlying Java's HashMap and HashSet data structures, which makes these a good choice if we want fast access based on key values and don't care whether the keys are stored in any particular order.