CSE 373
Project 3
Analysis and Implementation of Hash Tables

Code due: Thursday, December 5, 11:30pm
Documentation due: Friday, December 6, in class

Objectives

Overview

Congratulations! You've made it through UW and earned your degree, and you've just been hired by a struggling Internet startup company that builds web browser software. As your first task, the boss has asked you to build a component of the history cache of their web browser product. Specifically, you've been asked to evaluate designs a dictionary of URL strings (e.g. "http://www.yahoo.com/", "http://cs.washington.edu/homes/ladner/index.html", etc.) that stores the addresses of the web pages most recently visited by the web browser. Recalling the things you learned in 373, you decide that a hash table is a great way to implement a URL dictionary. Naturally, as part of the evaluation you will have to implement some options and write a final report giving a comparison of the options and a final recommendation.

Assignment

Your task is to design and implement a hash table for storing URL addresses. Evaluate your hash table design in the context of an application that takes a history file of URL strings and builds a dictionary of these strings. These strings represent queries to your dictionary. Your application should process these strings in order: for each query string, it should determine whether the string is present in the dictionary. If the string is not present, it should be added to the dictionary. If the string is present, the application should report that string as a "hit".

Your hash table should use hashing with open addressing only, but you must experiment with several different design parameters:

Note that you don't have to implement rehashing, though you may find it interesting to investigate your dictionary's performance in the presence of rehashing.

There are several hash functions that we discussed in class for strings.    The two hash functions for strings we talked about in class involve using hash table sizes that are prime numbers.  There is one method that does not require prime size hash tables. This method is a form of universal hashing, which has some nice provable properties. This form of universal hashing does not require prime number hash table sizes.  For a character x, define #(c) to be the numerical value of c.  For a string x = a1a2...an  define #(x) = #(a1) + #(a2)256 + #(a3)2562 + ... + #(an)256n-1.  Let Hsize be any number, not just a prime number and let k be a large odd number such that k*Hsize is still integer size. For each a and b  in the range 0 < a, b < k*Hsize we define the hash function on strings.

ha,b(x) = ((a #(x) + b ) mod k*Hsize) / k

This will give a hash value from 0 to Hsize - 1.  Naturally, you should use Horner's rule to evaluate this hash function, taking the mod k*Hsize after each operation.

To evaluate these design options, your application needs to report statistical information about the expense of lookup operations. Report these three metrics:

The first metric gives an idea of the average performance of a hash table, while the second gives an estimate of the worst case performance for a single query.

The URL Files

There are  six files that contain URL that you should use for your project.  The files each begin with an integer which is the count of how many URLs are in the files.  The files each contain slightly more than 1,000 URL.  If you want to use a large set you can concatenate
them together, removing the leading integer, or you can find other sources of URLs.

 history1history2history3 , history4 , history5 , history6
 

Experimental Methodology

Care must be taken to make sure that the numbers you generate are valid and reflect the true performance of the hash tables. This is not really an issue when counting the number of probes because internal program counters can keep track of probes. However, measuring the execution time per query takes special care. You should use a function to measure the wall clock time. There are standard ones in C++ and Java.  A key issue in using any clock is that the have millesecond resolution. You will need to run your experiments for several seconds to get accurate measurements.  In C++ you'll want to use the clock routine with header file time.h found in the standard C Library,

http://www.cplusplus.com/ref/ctime/ .

In Java you will want to use the method System.currentTimeMillis() found in the Sun Java Library.

 http://java.sun.com/j2se/1.3/docs/api/java/lang/System.html

 
You will discover the need to make repeated experiments for each of your data points. If you are running your experiment on a multi-tasking system, then your timing measurement can be affected by other tasks running on the same machine. How many trials should you make in order to get good data? What data should be thrown out? A common experimental approach is to make a small number of trials like 15 and take the median as indicative of the correct value. Taking the average of 15 would include bad data. To minimize interference from other tasks you might consider running your experiments on a single user computer. Even on a single user machine that is also multi-tasking you should be careful to stop any background jobs. Care must also be taken to avoid reading from or writing to files while you are making measurements. Disk I/0 time can easily dominate the times you are trying to measure.

Handing in the code

You are required to hand in your project directory with your implementation of your hash functions and your hash table functions: create, insert, and find. Your code should be well commented to explain in English what it does.  A README file should explain how your code can be intalled into a working application.  Note: we are not requiring a program that we can run directly, but if you do provid one, then explain how to use it in the README file.   The Turn-in will be announced as we approach the deadline.

Handing in the report

You will also turn in a formal type written report of a maximum of 4 pages (not counting graphs). Don't feel obligated to write 4 pages, but do not write more. If you can say what you have to in less than 4 pages please do. Your report should be written to an audience of programmers at the level that have completed a data structures course. Your report should have a title, section headings as needed, and contain at least the following graphs each with thorough explanations.

Your report should compare the various hashing approaches. It should provide a recommendation for the best design to be used by the web browser based on the results of your study. Note that your study does not include resizing, but eventually resizing will have to be added. Predict the performance of your recommendation in the presence of this additional functionality.

Your report will be graded on the quality of your graphs, your organization, your explanations and comparisons, and the justifications for your recommendation.

Hints:

The programming portion of this assignment is very small, but the overall workload is not as light as it may seem, so don't leave everything for the last minute. Report anything interesting that you come up with, especially if the numbers don't come out the way you expected. Feel free to add other design options or performance metrics to the above lists if you feel that you discovered something interesting.

Good luck!