CSE 326 Data Structures
Project 4
Analysis and Implementation of Hash Tables
due: Friday, June 6, 1997, 11:30pm
Overview
Congratulations! You've struggled through the UW Computer Science courses
and earned your degree, and you've just been hired by a hot new internet
startup company that builds web browser software. As your first task, the
boss has asked you to build a component of the history cache of their
web browser product. Specifically, you've been asked to design a dictionary
of URL strings (e.g. "http://www.yahoo.com/",
"http://cs.washington.edu/homes/fix/index.html",
etc.) that stores the addresses of the web pages most recently visted by
the web browser. Recalling the things you learned in 326,
you decide that a hash table is a great way to implement a URL dictionary.
Assignment
Your task is to design and implement a hash table for storing URL addresses.
Evaluate your hash table design in the context of an application that
takes a history file of URL strings and builds a dictionary of these strings.
These strings represent queries to your dictionary.
Your application should process these
strings in order: for each query string, it should determine whether
the string is present in the dictionary.
If the string is not present, it should be added to the dictionary.
If the string is present, the application should report that string as a
"hit".
Your hash table should use Closed Hashing only, but you must
experiment with several different design parameters:
- using 2 different hash functions
- using linear probing versus double hashing
- using different hash table sizes (at least six) for the same amount of
data (different loads).
Note that you don't have to implement rehashing, though you may
find it interesting to investigate your dictionary's performance
in the presence of rehashing.
To evaluate these design options, your application needs to report statistical
information about the expense of lookup operations. Report these
two metrics:
- the total number of URL string comparisons made plus the number of hash
function computations made.
- the most number of URL string comparisons made by a single query
The first metric gives an idea of the average performance of a hash table
by counting the most costly operations for a set of queries. The second
gives an estimate of the worst case performance for a single query.
We have provided text files in project4/history* which you can use
for your evaluation. At the top of these files is the number of unique
items in the URL list which you can use to help you size your hash
tables. The different data sets have different hit rates which may
affect the performance of your hash tables.
What to turn in
Turn in your full directory electronically before the due time,
including a compiled program. In addition, you should also turn in a
report (on paper) explaining your choices for each of the above design
parameters. Report the performance metrics for each design combination
(each choice of hash function, hash method, and table size).
Explain your results for each combination.
The report should provide a recommendation for the best design to be
used by the web browser based on the results of your study. Note that
in its final form the URL dictionary may need to resize itself and/or
delete entries dynamically. Predict the performance of your scheme
in the presence of this additional functionality.
The report should include at least one plot comparing the performance
of two hashing schemes. For example: a plot of the total string comparisons
used by linear probing versus double hashing for different table sizes.
You will be graded not only for correctness, but also for the quality
of your presentation, your choices of hash functions and table sizes,
and how well you explain your results.
Hints:
The programming portion of this assignment is very small, but the
overall workload is not as light as it may seem, so don't leave
everything for the last minute. Report anything interesting that you
come up with, especially if the numbers don't come out the way you
expected. Feel free to add other design options or performance metrics
to the above lists if you feel that you discovered something interesting.
We will consider giving a
small amount of extra credit for exceptional work which goes beyond
these basic requirements. Some suggestions include implementing
graphical output, algorithm animation, or implementing rehashing and
studying the performance as a function of the table size
(this is probably more complicated because of rehashing).
Good luck!