Programming Assignment 2: Trees and Hash Tables

Due: 5:00pm, Friday, May 16

The purpose of this assignment is to try out Binary Search Trees and Hash Tables as index structures for database systems. To make it easy on all of us, we will use the SEND EVENTS defined in Project 1 as the records to be stored and retrieved. Only the send events! No Event_Queue and no Mail_Queues in this one. Instead there will be a binary search tree and a hash table. 

The Idea of the Assignment

The idea of the assignment is to implement both (standard) Binary Search Trees and Hash Tables (with chaining) and compute relevant statistics of each for several data sets. The records to be stored in these indexes will be just the Send Events from the previous assignment, so you won't have to develop new structures, new I/O, etc. This is supposed to make you and the TAs very happy. As stated in Project 1, an event will have the following attributes: Instead of reading the events and putting them into an ordered list, you will read them and put them into (a) your binary search tree (which does NOT have to be balanced) and (b) your hash table (which MUST use chaining and should enter the events at the END of the chain (so they end up in the same order in which they were entered in that chain). (And do the entry in O(1) time, of course.) Then you will read a query data set whose queries are just the keys (some random sequence), and you will execute the searches for each key in both the binary search tree and the hash table. The time attribute is to be used as the key for both indexes. NOTE: in either case, do not put in the duplicate time. If the time is already in the BST or HashTable, just ignore the latter duplicate. 

You must implement the binary search tree and the hash table yourselves; do not use any BST or Hash Table class you may have. It is OK to use a linked list class from either Java or C++.

Binary Search Tree Index

The binary search tree is ordered according to the time attribute. It is to be the standard binary search tree in the text. Do NOT balance it, because we want to compute some statistics of how balanced it is! The following statistics of each tree should be computed after all the data for that experiment has been entered:

  • height   
  • maximum abs(balance factor) of any node   
  • average abs(balance factor) of all nonleaf nodes

    There are also statistics on the test retrievals for the tree.

  • minimum number of nodes accessed   
  • maximum number of nodes accessed   
  • average number of nodes accessed

    These access statistics can be computed as you do the accesses.

    All BST statistics should be output to BST.txt:

    Hash Table Index

    The time will be the hash key for the hash table. The hash function should be key mod tablesize. We will give you the tablesizes to try(the Hash Table size is given as a command-line argument). Implement chaining with lists as described above. The following statistics of each hash table should be computed (after all the data have been put in the table).

  • percentage of empty bins
  • maximum length of nonempty list
  • average length of nonempty list

    The access statistics for hash tables are as follows.

  • maximum number of list nodes accessed
  • average number of list nodes accessed (include zeros for empty lists)

    All hash table statistics should be output to Hash.txt:

    Output

    In addition to the statistics, we would like you to print the hash table to Chain.txt. For each bin (sequentially), print the bin number and then print just the time of each event in its linked list, from the beginning of the list to the end. You can use linked list methods for this. The output should be in this format for ease of checking:

    Bin 0 -> 25 -> 98 -> 143
    Bin 1 -> empty
    etc.

    Testing

    As in Project 1, there are several different event files for testing your program; the sample format of a 3-event file (which should look familiar) is as follows:

    event_id     event_type     time     receiver_id     sender_id     priority     message
    1                     S              652          10                 10             1         Assignments are so easy
    2                     S              362          10                 20             1         Midterm is coming soon
    3                     S              561          20                 10             3         CSE373 is interesting

    Download sending event files: 20E    300E

    There are also query files for each event file: 20E-1, 20E-2, 300E-1, 300E-2, 300E-3, 300E-4

    For the 20-event data set, try 30 and 50 for the size of the hash table. For the 300-event data set, try 700 for the size of the hash table. You should make the hash table size a command line argument to your program.

    Grading

    1. Compilable. Any program that cannot compile will receive a low grade. 

    2. Correctness. We will run your program by "java Statistics eventfilename queryfilename HashTableSize" (or "Statistics eventfilename queryfilename HashTableSize" for C++ programs). Then your program produces BST.txt, Hash.txt and Chain.txt, which contain all the statistics required.  We may specify the HashTableSize as follows: 50 for the 20 events, and 700 for the 300 events.

    3. Clearness of code, good coding style, simpleness, etc.

    4. Readme.txt, explaining how you implement your Binary Search Tree and Hash Table, and how you compute the statistics for both of them.

    What and how to turn in

    All source files should be turned in online.

    Java programmers: Make sure your program is compatible with Java2 SDK 1.4.x. Turn in only .java source files. You must include your main method in Statistics.java. Go to http://www.cs.washington.edu/education/courses/373/03sp/homework/prj2/turnin-java.html 

    C++ programmers: Your programs are strongly suggested to be compatible with MSVC 6.0. By default, your program will be compiled by MSVC 6.0 compiler at turn-in time. If your program only works on Linux/Unix, you must include detailed information in Readme.txt showing how to compile and run your program. Turn in only .cpp and .h files. You must include your main function in Statistics.cpp.
    Go to http://www.cs.washington.edu/education/courses/373/03sp/homework/prj2/turnin-cpp.html