CSE326
Summer 2004
Assignment 6
7/30/2004
Making a Hash of Everything
Files
Due in two parts: Electronic parts on Thursday evening, August 5
and 12,
paperwork in class the following days.
Partners will be shuffled and posted in the grades Database.
Turn in only
one copy of everything (except where noted otherwise), and put both
partners' names on the paper. Both will
receive the same grade under normal circumstance.
The goal of this part of the project is to gather experimental data
about the behavior of several hashing algorithms. One (or more)
of these will then be applied in a second part of the project.
To be honest: Deepak and I are in the process of
changing our mind about the 2nd part... possibly it would in fact be a
separate project (though you would continue with the same partners).
Part I:
The goal of the first week is to get an implementation of a Hash Table
working, to implement several hashing algorithms, to collect data about
their performance, and to present the data in some comparative and
comprehensive form. Although programming is an essential part of
the proejct, the programs are not the final goal or artifact -- they
are a means to an end. That end is the analysis and report you
make about the experiments you ran.
Your hash table should implement the IHashTable interface.
Hashing functions are embedded in objects which implement the IHasher
interface. See the files directory (linked above) for both of
these. They are relatively simple and straightforward. In
particular, IHashTable is much simpler than the Map interface which the
Java HashMap and TreeMap classes implement.
Although the interface doesn't require it, you should use chained
hashing as the implementation technique. The basic internal hash
table should be an array. For the chains, you may use Java List
classes (or implement your own linked lists -- up to you). You
should not use any Java Map or Set classes.
The test data will all be strings. Unless otherwise specified
(for particular test files that we might supply later), the key of each
data object will be the entire object, taken as a string.
Your test framework should be capable of reading text files of
data. Each line of the file is to be considered a full, separate
record (unless specified otherwise for some particular data set).
Your test framework should also be capable of taking the
strings from an internal object (such as an Iterator) that we will
supply to you. (For your own initial unit testing, you might
consider a generating some simple random data internally.) In
the
case of [corrected 730:] The
appropriate hash table size may vary with different test sets.
Besides that, experimenting with various sizes of hash tables would be
a natural thing to experiment with, so make sure it is easy in your
test set-up to vary the table size.
Implement a number of IHashers. For each hasher, and each test
data set (and each size of table being tested, etc.), store all the
data in the table and then collect statistics about it. The
IHashTable interface defines a keyDistribution method that can be the
basis for most of your analysis. There is no special programming
requirement concerning how you collect the data. You could write
it to files, or just print it out and manually transcribe it.
After all the data is available, analyze it and write a report about
the results. In the report, you might report on which hashers
worked best (and why, if possible) under which circumstances.
Include sufficient raw data in the report to justify your
conclusions. Regardless of the length of your report, prepare a
nice one-page "executive summary" of the results. This summary,
and/or the report, can contain charts, graphics, tables, or anything
else that might be appropriate.
One of the important decisions in designing your experiments and
writing the report is how to judge the effectiveness of the
algorithms. That is, what statistics should you calculate that
would be meaningful? You don't want the report to be just a huge
mass of numbers. In your report you should start by saying which
figures you consider to be important, and then focus on those.
Implement at least the following hashing algorithms (each one as a
separate IHasher class):
- Sum-of-all-characters
- Sum-of-first-five-characters
- Powers-of-37 (see textbook)
- Powers-of-37, on every third character: 0-3-6-9 etc.
- The internal Java hashCode() (mod table size, of course)
- Your own! Make up at least one original algorithm. If
possible, target it to a specific application (for example, Word
Finder, or another word game, or names in a phone book, or ... it would
be best if you came up with your own!)
Data sets (tentative -- we will supply):
Unix "words" file
An Internet trace log
"Random" strings (supplied as a file or Iterator)
During the first week, you should also be thinking ahead to the work of
Part II.
Part II:
Under reconsideration
First week Deliverables:
- Code: all code you wrote to implement the hash table. You
don't have to turn in any code that was used only to collect data.
- The report as described above.
Second Week Deliverables:
Have fun!