CSE 326 Data Structures
Project 4
Analysis and Implementation of Hash Tables
Code due: Thursday, March 11, 1999, 11:30pm
Documentation due: Friday, March 12, 1999, in class
Objectives
- Design hash data structures and hash functions
- Compare and evaluate probing methods
- Analyze and compare hash tables with different characteristics
Overview
Congratulations! You've struggled through the UW Computer Science courses
and earned your degree, and you've just been hired by a hot new Internet
startup company that builds web browser software. As your first task, the
boss has asked you to build a component of the history cache of their
web browser product. Specifically, you've been asked to evaluate
designs a dictionary
of URL strings (e.g. "http://www.yahoo.com/",
"http://cs.washington.edu/homes/ladner/index.html",
etc.) that stores the addresses of the web pages most recently visited by
the web browser. Recalling the things you learned in 326,
you decide that a hash table is a great way to implement a URL dictionary.
Naturally, as part of the evaluation you will have to implement some options
and write a final report giving a comparison of the options and a final
recommendation.
Assignment
Your task is to design and implement a hash table for storing URL addresses.
Evaluate your hash table design in the context of an application that
takes a history file of URL strings and builds a dictionary of these strings.
These strings represent queries to your dictionary.
Your application should process these
strings in order: for each query string, it should determine whether
the string is present in the dictionary.
If the string is not present, it should be added to the dictionary.
If the string is present, the application should report that string as a
"hit".
Your hash table should use Closed Hashing only, but you must
experiment with several different design parameters:
- using 2 different hash functions
- using linear probing versus quadratic probing
- using different hash table sizes (at least ten) for the same amount of
data (different load factors).
Note that you don't have to implement rehashing, though you may
find it interesting to investigate your dictionary's performance
in the presence of rehashing.
There were several hash functions that were discussed in class or in section:
the division method and universal hashing that require primes, and the
multiplication method and "universal hashing without primes" that don't
require prime size hash tables. Feel free to try out any of these hash
functions or any other that you might think up or find in a research paper.
To evaluate these design options, your application needs to report statistical
information about the expense of lookup operations. Report these
three metrics:
- the average number of URL string comparisons made per query
- the maximum number of URL string comparisons made by a single query
- the average running time per query
The first metric gives an idea of the average performance of a hash table,
while the second gives an estimate of the worst case performance for a single
query.
Experimental Methodology
Care must be taken to make sure that the numbers you generate are valid
and reflect the true performance of the hash tables. This is not
really an issue when counting the number of probes because internal program
counters can keep track of probes. However, measuring the execution
time per query takes special care.
You should use a function to measure the wall clock time. A good one to
use is the standard C library function clock (defined in
"time.h")
which returns the number of clock ticks that have passed since your program
started. There are also some non-standard functions available on our UNIX
machines that may be more accurate. A key issue in using clock is
that it has very low resolution of about 10 milliseconds. You will need
to run your experiments for several seconds to get accurate measurements.
You will discover the need to make
repeated experiments for each of your data points. If you are running
your experiment on a multi-tasking system, then your timing measurement
can be affected by other tasks running on the same machine. How many
trials should you make in order to get good data? What data should be
thrown out? A common experimental approach is to make a small number of
trials like 15 and take the median as indicative of the correct value.
Taking the average of 15 would include bad data.
To minimize interference from other tasks you might consider running your
experiments on a single user computer. Even on a single user machine that
is also multi-tasking you should be careful to stop any background jobs.
Care must also be taken to avoid reading from or writing to files while
you are making measurements. Disk I/0 time can easily dominate the
times you are trying to measure.
We have provided text files in project4/history* which you can use
for your evaluation. At the top of these files is the number of unique
items in the URL list which you can use to help you size your hash
tables. The different data sets have different hit rates which may
affect the performance of your hash tables.
Handing in the code
You are required to hand in your project directory with your implementation
of your hash functions and your hash table functions: create, insert,
and find. Your code should be well documented so that they could be easily
installed into a working application.
Turn your files in electronically using turnin:
% cd ~/326stuff/myproj4dir
% make clean
% cd ../
% turnin -c cse326=AB -p project4 myproj4dir
Handing in the report
You will also turn in a formal type written report of a maximum of 5 pages
(not counting graphs). Don't feel obligated to write 5 pages, but do not
write more. If you can say what you have to in less than 5 pages please do.
Your report should be written to an audience of programmers at the level that
have completed a data structures course. Your report should have a title,
section headings as needed, and contain at least
the following graphs each with
thorough explanations.
- A graph showing the average number of probes per find for linear probing
and quadratic probing for each of the two hash functions for about
10 different load factors. There are four plots in the graph with
the x-axis being the load factor.
- A graph showing the maximum number of probes per find for linear probing
and quadratic probing for each of the two hash functions for about
10 different load factors.
There are four plots in the graph with
the x-axis being the load factor.
- A graph showing the average running time per find for linear probing
and quadratic probing for each of the two hash functions for about
10 different load factors.
There are four plots in the graph with
the x-axis being the load factor.
Your report should compare the various hashing approaches.
It should provide a recommendation for the best design to be
used by the web browser based on the results of your study. Note that your
study does not include resizing, but eventually resizing will have to
be added. Predict the performance of your recommendation
in the presence of this additional functionality.
Your report will be graded on the quality of your graphs, your organization,
your explanations and comparisons, and the justifications for your
recommendation.
Hints:
The programming portion of this assignment is very small, but the
overall workload is not as light as it may seem, so don't leave
everything for the last minute. Report anything interesting that you
come up with, especially if the numbers don't come out the way you
expected. Feel free to add other design options or performance metrics
to the above lists if you feel that you discovered something interesting.
It is probably a good idea not to allow the compiler to optimize your program.
In the past, we have seen bugs in the optimizer cause an O(1) algorithm to run
O(lg n). In MSVC the optimizer is on by default in Release mode, so you may
wish to stick with Debug mode.
Good luck!