For this assignment, you will modify a program which counts the number of occurrences of each word in a document and outputs them in order of their rank. Currently, the program uses a binary search tree to map words to their frequencies.
You will extend the BinarySearchTree class to create one to three subclasses, an AVL tree class, a Splay tree class, and/or a Hashtable class. Then, you will modify the word counting program to use your splay tree or your AVL tree or your hash table in place of the binary search tree (there should be a command line argument to choose among the three: -b for BST, -a for AVL, -s for splay, -t for hash table).
Your splay/AVL/hash implementations must be templated to accept any types (which support the operations mentioned in BinarySearchTree.h) as its KeyType and ValueType. However, we strongly recommend developing char to int versions of each first and testing them using the file test.cpp. When those work, add in the templating and make sure your classes work for both test.cpp and the word counting code. (Note: If you implement a hash table, justify your choice of probing method and load factor.)
Many details of the implementation and suggestions for your implementation are written as comments in the existing code, so make sure you look that code over!
You must also instrument both the BST code and whatever code you write to report statistics about the dictionary. For all types, write functions that return the number of distinct words in the dictionary and the total number of words seen in the document. For trees, write functions that compute the average depth of a tree, the deepest node in the tree (height of the root), and the average number of nodes visited per word seen. For hash tables, compute the total number of probes performed (including rehashing) and the average number of probes per word seen. Compare the dictionaries that you implement with each other and the BST dictionary. Do that by comparing running times, tree sizes, average number of probes/nodes visited, and so on.
Finally, compare word frequences for different sources and different sizes of sources. Do the top 20 words vary much between sources?
Teams
You can work individually or in teams of two or three on this assignment. Individuals must implement one of Splay, AVL, or Hash table dictionaries. Teams of two can choose any two of those. Teams of three must do all three. Note that implementing hash tables is likely to be more programmatically challenging as the two tree techniques involve extending the pre-existing BST class.
You may choose how to divide the work under three conditions: first, document each team member's effort in the README file; second, work together and make sure all team members understand your answers to the questions below; and third, understand at least at a high level how your team members' code is structured and how it works. Remember to test your team's code as a whole to make sure that your portions work properly together! Also, be aware that except in extreme cases when you notify us in advance of the deadline, all team members will receive the same grade for the project.
We strongly encourage working in teams of two or three!. Individuals must complete most of the workload of teams.
What we've provided
As a starting point we've provided a few files. Files that you might need to change are marked with a star (*).
make
will use this to build your program and the
sort program. You will need to modify the makefile to add any new
files you create. We have used a new Makefile format which
requires fewer changes on your part. See the
Makefile (especially if you want your code to run fast!).Input files
Any text file is a reasonable input file. We've provided a number of interesting ones from Project Gutenberg which reside in the directory /cse/courses/cse326/00au/texts/. Please do not copy these to your directories! They take quite a bit of disk space as is. If you'd like to be able to reference them more easily, use the following command in your project directory:
ln -s /cse/courses/cse326/00au/texts
.
(Be sure to include the '.'!) You should now be able to reference them as if the directory
texts
were a subdirectory of your project directory.
Note that we will use bible_kingjames (the largest single file available) and usr_dict_words (a hefty, sorted-order file) as two of the test files. So, if your program does not run on these, you should try to fix it!
Handing in the code
You are required to hand in your project3 directory with all of your code. Included in this directory should be all the files listed above as well as any new files you create including your README.
Note: of the files we provide to you, you should need to change only Makefile, word_count.cpp, test.cpp, and possibly AVLTree.h to implement the additional functionality of an AVL tree, splay tree, or hash table. You may create whatever other files you wish. If you want to change any other files in the project, contact one of us first! (Note: this only applies to implementing the base functionality of an AVL tree, splay tree, and hash table. Instrumenting the code will require changes to those files.)
Turn your files in electronically using turnin
:
% cd ~/326stuff/myproj3dir % make full_clean % cd ../ % turnin -c cse326 -p project3 myproj3dir
The README
Your README file should document your submission (though it does not by any means replace comments in the code!). It should contain the following:
Plotting
For Project 3, I've add the script
Plot
to the directory
/cse/courses/cse326/00au/project3
Plot produces a plot (in a postscript file) of the word distributions
calculated by your program. E.g.,
bst_word_count 100000 < some_file > out
Plot out
creates the file "out.ps". This file can be either sent directly to a
printer (using lpr) or viewed using ghostview. The X axis is the rank
of the word, and the Y axis the number of times it occurs.
Create plots of the results of your test files. In the discussion section of your report note if the shapes similar, or if they vary widely? Ordinary natural language text usually follows a pattern known as a Zipf curve (named after its discoverer, Harvard linquist George Zipf). For more information about Zipf curves you can start with http://www.useit.com/alertbox/zipf.html.
bst_word_count -h
for more details.