CSE326
Summer 2004
Assignment 6
7/30/2004

Making a Hash of Everything


Files

Due in two parts
: Electronic parts on Thursday evening, August 5 and 12, paperwork in class the following days.

Partners will be shuffled and posted in the grades Database.  Turn in only one copy of everything (except where noted otherwise), and put both partners' names on the paper.  Both will receive the same grade under normal circumstance.

The goal of this part of the project is to gather experimental data about the behavior of several hashing algorithms.  One (or more) of these will then be applied in a second part of the project.

To be honest: Deepak and I are in the process of changing our mind about the 2nd part... possibly it would in fact be a separate project (though you would continue with the same partners).

Part I:
The goal of the first week is to get an implementation of a Hash Table working, to implement several hashing algorithms, to collect data about their performance, and to present the data in some comparative and comprehensive form.  Although programming is an essential part of the proejct, the programs are not the final goal or artifact -- they are a means to an end.  That end is the analysis and report you make about the experiments you ran.

Your hash table should implement the IHashTable interface.  Hashing functions are embedded in objects which implement the IHasher interface.  See the files directory (linked above) for both of these.  They are relatively simple and straightforward.  In particular, IHashTable is much simpler than the Map interface which the Java HashMap and TreeMap classes implement.  

Although the interface doesn't require it, you should use chained hashing as the implementation technique.  The basic internal hash table should be an array.  For the chains, you may use Java List classes (or implement your own linked lists -- up to you).  You should not use any Java Map or Set classes.

The test data will all be strings.  Unless otherwise specified (for particular test files that we might supply later), the key of each data object will be the entire object, taken as a string.

Your test framework should be capable of reading text files of data.  Each line of the file is to be considered a full, separate record (unless specified otherwise for some particular data set).

Your test framework should also be capable of taking the strings from an internal object (such as an Iterator) that we will supply to you.  (For your own initial unit testing, you might consider a generating some simple random data internally.)  In the case of  [corrected 730:] The appropriate hash table size may vary with different test sets.  Besides that, experimenting with various sizes of hash tables would be a natural thing to experiment with, so make sure it is easy in your test set-up to vary the table size.

Implement a number of IHashers.  For each hasher, and each test data set (and each size of table being tested, etc.), store all the data in the table and then collect statistics about it.  The IHashTable interface defines a keyDistribution method that can be the basis for most of your analysis.   There is no special programming requirement concerning how you collect the data.  You could write it to files, or just print it out and manually transcribe it. 

After all the data is available, analyze it and write a report about the results.  In the report, you might report on which hashers worked best (and why, if possible) under which circumstances.  Include sufficient raw data in the report to justify your conclusions.  Regardless of the length of your report, prepare a nice one-page "executive summary" of the results.  This summary, and/or the report, can contain charts, graphics, tables, or anything else that might be appropriate.

One of the important decisions in designing your experiments and writing the report is how to judge the effectiveness of the algorithms.  That is, what statistics should you calculate that would be meaningful?  You don't want the report to be just a huge mass of numbers.  In your report you should start by saying which figures you consider to be important, and then focus on those.

Implement at least the following hashing algorithms (each one as a separate IHasher class):
Data sets (tentative -- we will supply):
Unix "words" file
An Internet trace log
"Random" strings (supplied as a file or Iterator)

During the first week, you should also be thinking ahead to the work of Part II.

Part II:

Under reconsideration

First week Deliverables:

Second Week Deliverables:
Have fun!