Your hash table should use hashing with open addressing only, but you must experiment with several different design parameters:
There are several hash functions that we discussed in class for strings. The two hash functions for strings we talked about in class involve using hash table sizes that are prime numbers. There is one method that does not require prime size hash tables. This method is a form of universal hashing, which has some nice provable properties. This form of universal hashing does not require prime number hash table sizes. For a character x, define #(c) to be the numerical value of c. For a string x = a1a2...an define #(x) = #(a1) + #(a2)256 + #(a3)2562 + ... + #(an)256n-1. Let Hsize be any number, not just a prime number and let k be a large odd number such that k*Hsize is still integer size. For each a and b in the range 0 < a, b < k*Hsize we define the hash function on strings.
ha,b(x) = ((a #(x) + b ) mod k*Hsize) / k
This will give a hash value from 0 to Hsize - 1. Naturally, you should use Horner's rule to evaluate this hash function, taking the mod k*Hsize after each operation.
To evaluate these design options, your application needs to report statistical information about the expense of lookup operations. Report these three metrics:
The URL Files
There are six files that contain URL that you should use for your
project. The files each begin with an integer which is the count
of how many URLs are in the files. The files each contain slightly
more than 1,000 URL. If you want to use a large set you can concatenate
them together, removing the leading integer, or you can find other
sources of URLs.
history1 , history2
, history3 , history4
, history5 , history6
Experimental Methodology
Care must be taken to make sure that the numbers you generate are valid and reflect the true performance of the hash tables. This is not really an issue when counting the number of probes because internal program counters can keep track of probes. However, measuring the execution time per query takes special care. You should use a function to measure the wall clock time. There are standard ones in C++ and Java. A key issue in using any clock is that the have millesecond resolution. You will need to run your experiments for several seconds to get accurate measurements. In C++ you'll want to use the clock routine with header file time.h found in the standard C Library,
http://www.cplusplus.com/ref/ctime/ .
In Java you will want to use the method System.currentTimeMillis() found in the Sun Java Library.
http://java.sun.com/j2se/1.3/docs/api/java/lang/System.html
You will discover the need to make repeated experiments for each of
your data points. If you are running your experiment on a multi-tasking
system, then your timing measurement can be affected by other tasks running
on the same machine. How many trials should you make in order to get good
data? What data should be thrown out? A common experimental approach is
to make a small number of trials like 15 and take the median as indicative
of the correct value. Taking the average of 15 would include bad data.
To minimize interference from other tasks you might consider running your
experiments on a single user computer. Even on a single user machine that
is also multi-tasking you should be careful to stop any background jobs.
Care must also be taken to avoid reading from or writing to files while
you are making measurements. Disk I/0 time can easily dominate the times
you are trying to measure.
Handing in the code
You are required to hand in your project directory with your implementation of your hash functions and your hash table functions: create, insert, and find. Your code should be well commented to explain in English what it does. A README file should explain how your code can be intalled into a working application. Note: we are not requiring a program that we can run directly, but if you do provid one, then explain how to use it in the README file. The Turn-in will be announced as we approach the deadline.
Handing in the report
You will also turn in a formal type written report of a maximum of 4 pages (not counting graphs). Don't feel obligated to write 4 pages, but do not write more. If you can say what you have to in less than 4 pages please do. Your report should be written to an audience of programmers at the level that have completed a data structures course. Your report should have a title, section headings as needed, and contain at least the following graphs each with thorough explanations.
Your report will be graded on the quality of your graphs, your organization, your explanations and comparisons, and the justifications for your recommendation.
Good luck!