CSE 326 Data Structures
Project 3
Counting Words

Due: Wednesday, November 15, 2000

Objectives

Notice that there are a large number of objectives. If you wait until the weekend before the project is due, you'll have a difficult time. Start early!

Overview

Word frequency counting is the first step in various text analyses used to establish authorship, automatically grade essays, perform information retrieval tasks, and automatically translate documents. Counting word frequency is therefore a vital process involved in a variety of research areas. More than that, it gets you using dictionaries, and it's fun!

Assignment

For this assignment, you will modify a program which counts the number of occurrences of each word in a document and outputs them in order of their rank. Currently, the program uses a binary search tree to map words to their frequencies.

You will extend the BinarySearchTree class to create one to three subclasses, an AVL tree class, a Splay tree class, and/or a Hashtable class. Then, you will modify the word counting program to use your splay tree or your AVL tree or your hash table in place of the binary search tree (there should be a command line argument to choose among the three: -b for BST, -a for AVL, -s for splay, -t for hash table).

Your splay/AVL/hash implementations must be templated to accept any types (which support the operations mentioned in BinarySearchTree.h) as its KeyType and ValueType. However, we strongly recommend developing char to int versions of each first and testing them using the file test.cpp. When those work, add in the templating and make sure your classes work for both test.cpp and the word counting code. (Note: If you implement a hash table, justify your choice of probing method and load factor.)

Many details of the implementation and suggestions for your implementation are written as comments in the existing code, so make sure you look that code over!

You must also instrument both the BST code and whatever code you write to report statistics about the dictionary. For all types, write functions that return the number of distinct words in the dictionary and the total number of words seen in the document. For trees, write functions that compute the average depth of a tree, the deepest node in the tree (height of the root), and the average number of nodes visited per word seen. For hash tables, compute the total number of probes performed (including rehashing) and the average number of probes per word seen. Compare the dictionaries that you implement with each other and the BST dictionary. Do that by comparing running times, tree sizes, average number of probes/nodes visited, and so on.

Finally, compare word frequences for different sources and different sizes of sources. Do the top 20 words vary much between sources?

Teams

You can work individually or in teams of two or three on this assignment. Individuals must implement one of Splay, AVL, or Hash table dictionaries. Teams of two can choose any two of those. Teams of three must do all three. Note that implementing hash tables is likely to be more programmatically challenging as the two tree techniques involve extending the pre-existing BST class.

You may choose how to divide the work under three conditions: first, document each team member's effort in the README file; second, work together and make sure all team members understand your answers to the questions below; and third, understand at least at a high level how your team members' code is structured and how it works. Remember to test your team's code as a whole to make sure that your portions work properly together! Also, be aware that except in extreme cases when you notify us in advance of the deadline, all team members will receive the same grade for the project.

We strongly encourage working in teams of two or three!. Individuals must complete most of the workload of teams.

What we've provided

As a starting point we've provided a few files. Files that you might need to change are marked with a star (*).

Dictionary.h
This defines the abstract base class Dictionary (representing the Dictionary ADT). Your AVL and splay classes extend BinarySearchTree which itself extends Dictionary. Your hash table class should extend Dictionary.
BinarySearchTree.h,BinarySearchTree.cpp
These define the concrete BinarySearchTree class which extends Dictionary. Your AVL and splay classes will extend this class. This file also defines the TreeIterator class (which should work unaltered on your classes) and the BSTNode class (from which you may need to extend an AVLNode class for AVL trees).
WordCounter.h,WordCounter.cpp
These define a class for counting words (WordCounter) and a class for turning a WordCounter into a frequency ranked list of words (FrequentWordList). Each of these requires that you pass in a data structure when you instantiate them (a Dictionary for WordCounter and a PriorityQueue for FrequentWordList).
*word_count.cpp
This is the main program file for the word counting program. You should change this to:
*test.cpp
This is a program file for testing your dictionary. It reads characters from stdin, inserts them into the tree, and prints the tree after each end-of-line. You can change this to use and test your splay and AVL trees.
PQueue.h, UnsLLPQueue.cpp, UnsLLPQueue.h
A priority queue implementation.
*Makefile
make will use this to build your program and the sort program. You will need to modify the makefile to add any new files you create. We have used a new Makefile format which requires fewer changes on your part. See the Makefile (especially if you want your code to run fast!).
AVLTree.h
This file is a sample of how you might declare your AVLTree class. Note that we declare an AVLNode class which extends BSTNode for use inside AVLTree. You need not use our AVLTree.h, but it should be a good starting point.
All provided files can be found in the project directory: /cse/courses/cse326/00au/project3.

Input files

Any text file is a reasonable input file. We've provided a number of interesting ones from Project Gutenberg which reside in the directory /cse/courses/cse326/00au/texts/. Please do not copy these to your directories! They take quite a bit of disk space as is. If you'd like to be able to reference them more easily, use the following command in your project directory:

ln -s /cse/courses/cse326/00au/texts .
(Be sure to include the '.'!) You should now be able to reference them as if the directory texts were a subdirectory of your project directory.

Note that we will use bible_kingjames (the largest single file available) and usr_dict_words (a hefty, sorted-order file) as two of the test files. So, if your program does not run on these, you should try to fix it!

Handing in the code

You are required to hand in your project3 directory with all of your code. Included in this directory should be all the files listed above as well as any new files you create including your README.

Note: of the files we provide to you, you should need to change only Makefile, word_count.cpp, test.cpp, and possibly AVLTree.h to implement the additional functionality of an AVL tree, splay tree, or hash table. You may create whatever other files you wish. If you want to change any other files in the project, contact one of us first! (Note: this only applies to implementing the base functionality of an AVL tree, splay tree, and hash table. Instrumenting the code will require changes to those files.)

Turn your files in electronically using turnin:

% cd ~/326stuff/myproj3dir
% make full_clean
% cd ../
% turnin -c cse326 -p project3 myproj3dir

The README

Your README file should document your submission (though it does not by any means replace comments in the code!). It should contain the following:

Also answer the following questions (justify your answers, of course; 3-4 sentences apiece should be fine):

Plotting

For Project 3, I've add the script
Plot
to the directory
/cse/courses/cse326/00au/project3
Plot produces a plot (in a postscript file) of the word distributions calculated by your program. E.g.,
bst_word_count 100000 < some_file > out
Plot out
creates the file "out.ps". This file can be either sent directly to a printer (using lpr) or viewed using ghostview. The X axis is the rank of the word, and the Y axis the number of times it occurs.

Create plots of the results of your test files. In the discussion section of your report note if the shapes similar, or if they vary widely? Ordinary natural language text usually follows a pattern known as a Zipf curve (named after its discoverer, Harvard linquist George Zipf). For more information about Zipf curves you can start with http://www.useit.com/alertbox/zipf.html.

Sample word counter

The executable bst_word_count is simply the compiled version of the code in the project3 directory. It uses a BST to store the words. Run bst_word_count -h for more details.

Extra Credit

For one Extra Credit point, replace the unsorted linked list priority queue implementation that the project currently uses with a heap implementation of a priority queue. Document which type of heap you choose to use. Note: do the project first and do the extra credit only when you're sure you can complete the required parts on time.