Link

Hashing - Hash Function

Complete the Reading Quiz by 3:00pm before lecture.

We’ve explored several ways to implement the Set and Map abstract data type. We organized the data into a Binary Search Tree, and then introduced self-balancing invariants to yield B-Trees and LLRB Trees.

What are the limitations we place on the types of the objects stored in a search tree?

Items in a search tree need to be comparable. When we decide where a new item goes in a BST, we have to answer the question, “Is the item less-than or greater-than the root?” For some objects, this question may make no sense.

What if we want to store in a BinarySearchTree some of the Java Objects we’ve seen in homeworks, such as DefaultTerm or PriorityNode<T>? We do not have a good way to compare them directly. One way to make them comparable is to define a Comparator with certain rules (i.e. compare by priority/weight). However, it is not always possible to define a complete ordering of items in our tree; for example, if two PriorityNode<T>s have the same priority should we use the value to break the tie? What if the values aren’t Comparable?

A completely different approach would be to convert Objects of any size and number of fields to a single, fixed-size integer using some Hash Function. We know how to compare integers!

Hash Function

A hash function is any function that maps data of arbitrary size to fixed-size integers. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. It is possible for a hash function to map two different items to the same hash value; this is known as collision.

Properties of Hash Function:

  1. Deterministic : For a given input value it must always generate the same hash value.
    Why is determinism important?

    If a hash function is not deterministic, then the same data would result in different hash values. If the programmer uses the hash value to represent the original item, the different values would false imply they are different when in fact they should be the same.

  2. Efficiency: A hash function should take reasonable amount of time to run.

  3. Uniformity: A good hash function should map the expected inputs as evenly as possible over its output range.
    Why is uniformity important?

    Uniformity can reduce the probability of different items mapping to the same number (ie, reduce the likelihood of a collision). This allows us to compare as many different Objects accurately as possible.

Given a good hash function, we could efficiently hash complex Objects such as PriorityNode<T> to their hash value and use these hash values when inserting into a BST.

Reading Quiz