CSE 326, Winter 1999, Programming Project 4

CSE 326 Data Structures
Project 4
Analysis and Implementation of Hash Tables

Code due: Thursday, March 11, 1999, 11:30pm
Documentation due: Friday, March 12, 1999, in class

Objectives

Design hash data structures and hash functions
Compare and evaluate probing methods
Analyze and compare hash tables with different characteristics

Overview

Congratulations! You've struggled through the UW Computer Science courses and earned your degree, and you've just been hired by a hot new Internet startup company that builds web browser software. As your first task, the boss has asked you to build a component of the history cache of their web browser product. Specifically, you've been asked to evaluate designs a dictionary of URL strings (e.g. "http://www.yahoo.com/", "http://cs.washington.edu/homes/ladner/index.html", etc.) that stores the addresses of the web pages most recently visited by the web browser. Recalling the things you learned in 326, you decide that a hash table is a great way to implement a URL dictionary. Naturally, as part of the evaluation you will have to implement some options and write a final report giving a comparison of the options and a final recommendation.

Assignment

Your task is to design and implement a hash table for storing URL addresses. Evaluate your hash table design in the context of an application that takes a history file of URL strings and builds a dictionary of these strings. These strings represent queries to your dictionary. Your application should process these strings in order: for each query string, it should determine whether the string is present in the dictionary. If the string is not present, it should be added to the dictionary. If the string is present, the application should report that string as a "hit".

Your hash table should use Closed Hashing only, but you must experiment with several different design parameters:

using 2 different hash functions
using linear probing versus quadratic probing
using different hash table sizes (at least ten) for the same amount of data (different load factors).

Note that you don't have to implement rehashing, though you may find it interesting to investigate your dictionary's performance in the presence of rehashing. There were several hash functions that were discussed in class or in section: the division method and universal hashing that require primes, and the multiplication method and "universal hashing without primes" that don't require prime size hash tables. Feel free to try out any of these hash functions or any other that you might think up or find in a research paper.

To evaluate these design options, your application needs to report statistical information about the expense of lookup operations. Report these three metrics:

the average number of URL string comparisons made per query
the maximum number of URL string comparisons made by a single query
the average running time per query

The first metric gives an idea of the average performance of a hash table, while the second gives an estimate of the worst case performance for a single query.

Experimental Methodology

Care must be taken to make sure that the numbers you generate are valid and reflect the true performance of the hash tables. This is not really an issue when counting the number of probes because internal program counters can keep track of probes. However, measuring the execution time per query takes special care. You should use a function to measure the wall clock time. A good one to use is the standard C library function clock (defined in "time.h") which returns the number of clock ticks that have passed since your program started. There are also some non-standard functions available on our UNIX machines that may be more accurate. A key issue in using clock is that it has very low resolution of about 10 milliseconds. You will need to run your experiments for several seconds to get accurate measurements.

You will discover the need to make repeated experiments for each of your data points. If you are running your experiment on a multi-tasking system, then your timing measurement can be affected by other tasks running on the same machine. How many trials should you make in order to get good data? What data should be thrown out? A common experimental approach is to make a small number of trials like 15 and take the median as indicative of the correct value. Taking the average of 15 would include bad data. To minimize interference from other tasks you might consider running your experiments on a single user computer. Even on a single user machine that is also multi-tasking you should be careful to stop any background jobs. Care must also be taken to avoid reading from or writing to files while you are making measurements. Disk I/0 time can easily dominate the times you are trying to measure.

We have provided text files in project4/history* which you can use for your evaluation. At the top of these files is the number of unique items in the URL list which you can use to help you size your hash tables. The different data sets have different hit rates which may affect the performance of your hash tables.

Handing in the code

You are required to hand in your project directory with your implementation of your hash functions and your hash table functions: create, insert, and find. Your code should be well documented so that they could be easily installed into a working application.

Turn your files in electronically using turnin:

% cd ~/326stuff/myproj4dir % make clean % cd ../ % turnin -c cse326=AB -p project4 myproj4dir

Handing in the report

You will also turn in a formal type written report of a maximum of 5 pages (not counting graphs). Don't feel obligated to write 5 pages, but do not write more. If you can say what you have to in less than 5 pages please do. Your report should be written to an audience of programmers at the level that have completed a data structures course. Your report should have a title, section headings as needed, and contain at least the following graphs each with thorough explanations.

A graph showing the average number of probes per find for linear probing and quadratic probing for each of the two hash functions for about 10 different load factors. There are four plots in the graph with the x-axis being the load factor.
A graph showing the maximum number of probes per find for linear probing and quadratic probing for each of the two hash functions for about 10 different load factors. There are four plots in the graph with the x-axis being the load factor.
A graph showing the average running time per find for linear probing and quadratic probing for each of the two hash functions for about 10 different load factors. There are four plots in the graph with the x-axis being the load factor.

Your report should compare the various hashing approaches. It should provide a recommendation for the best design to be used by the web browser based on the results of your study. Note that your study does not include resizing, but eventually resizing will have to be added. Predict the performance of your recommendation in the presence of this additional functionality.

Your report will be graded on the quality of your graphs, your organization, your explanations and comparisons, and the justifications for your recommendation.

Hints:

The programming portion of this assignment is very small, but the overall workload is not as light as it may seem, so don't leave everything for the last minute. Report anything interesting that you come up with, especially if the numbers don't come out the way you expected. Feel free to add other design options or performance metrics to the above lists if you feel that you discovered something interesting.

It is probably a good idea not to allow the compiler to optimize your program. In the past, we have seen bugs in the optimizer cause an O(1) algorithm to run O(lg n). In MSVC the optimizer is on by default in Release mode, so you may wish to stick with Debug mode.

Good luck!