CSE 326 Data Structures
Project 4
Analysis and Implementation of Hash Tables

due: Friday, June 6, 1997, 11:30pm

Overview

Congratulations! You've struggled through the UW Computer Science courses and earned your degree, and you've just been hired by a hot new internet startup company that builds web browser software. As your first task, the boss has asked you to build a component of the history cache of their web browser product. Specifically, you've been asked to design a dictionary of URL strings (e.g. "http://www.yahoo.com/", "http://cs.washington.edu/homes/fix/index.html", etc.) that stores the addresses of the web pages most recently visted by the web browser. Recalling the things you learned in 326, you decide that a hash table is a great way to implement a URL dictionary.

Assignment

Your task is to design and implement a hash table for storing URL addresses. Evaluate your hash table design in the context of an application that takes a history file of URL strings and builds a dictionary of these strings. These strings represent queries to your dictionary. Your application should process these strings in order: for each query string, it should determine whether the string is present in the dictionary. If the string is not present, it should be added to the dictionary. If the string is present, the application should report that string as a "hit".

Your hash table should use Closed Hashing only, but you must experiment with several different design parameters:

Note that you don't have to implement rehashing, though you may find it interesting to investigate your dictionary's performance in the presence of rehashing.

To evaluate these design options, your application needs to report statistical information about the expense of lookup operations. Report these two metrics:

The first metric gives an idea of the average performance of a hash table by counting the most costly operations for a set of queries. The second gives an estimate of the worst case performance for a single query.

We have provided text files in project4/history* which you can use for your evaluation. At the top of these files is the number of unique items in the URL list which you can use to help you size your hash tables. The different data sets have different hit rates which may affect the performance of your hash tables.

What to turn in

Turn in your full directory electronically before the due time, including a compiled program. In addition, you should also turn in a report (on paper) explaining your choices for each of the above design parameters. Report the performance metrics for each design combination (each choice of hash function, hash method, and table size). Explain your results for each combination.

The report should provide a recommendation for the best design to be used by the web browser based on the results of your study. Note that in its final form the URL dictionary may need to resize itself and/or delete entries dynamically. Predict the performance of your scheme in the presence of this additional functionality.

The report should include at least one plot comparing the performance of two hashing schemes. For example: a plot of the total string comparisons used by linear probing versus double hashing for different table sizes.

You will be graded not only for correctness, but also for the quality of your presentation, your choices of hash functions and table sizes, and how well you explain your results.

Hints:

The programming portion of this assignment is very small, but the overall workload is not as light as it may seem, so don't leave everything for the last minute. Report anything interesting that you come up with, especially if the numbers don't come out the way you expected. Feel free to add other design options or performance metrics to the above lists if you feel that you discovered something interesting. We will consider giving a small amount of extra credit for exceptional work which goes beyond these basic requirements. Some suggestions include implementing graphical output, algorithm animation, or implementing rehashing and studying the performance as a function of the table size (this is probably more complicated because of rehashing).

Good luck!