CSE 326 Data Structures
Project 3
Counting Words

Due: Wednesday, February 16, 2000

Objectives

Overview

Word frequency counting is the first step in various text analyses used to establish authorship, automatically grade essays, perform information retrieval tasks, and automatically translate documents. Counting word frequency is therefore a vital process involved in... ah heck. It gets you using dictionaries, and it's fun!

Assignment

For this assignment, you will modify a program which counts the number of occurrences of each word in a document and outputs them in order of their rank. Currently, the program uses a binary search tree to map words to their frequencies and an unsorted linked list priority queue to order them for printing.

You will extend the BinarySearchTree class to create two subclasses, an AVL tree class and a Splay tree class. Then, you will modify the word counting program to use your splay tree or your AVL tree in place of the binary search tree (there should be a command line argument to choose among the three: -b for BST, -a for AVL, and -s for splay). Also, you should replace the unsorted linked list priority queue with your own (or one of your group member's) heap implementation.

Your splay and AVL implementations must be templated to accept any types (which support the operations mentioned in BinarySearchTree.h) as its KeyType and ValueType. However, we strongly recommend developing char to int versions of each first and testing them using the file test.cpp. When those work, add in the templating and make sure your classes work for both test.cpp and the word counting code.

Many details of the implementation and suggestions for your implementation are written as comments in the existing code, so make sure you look that code over!

Teams

You are required to work on a team of two or three on this assignment. If you have not found a team by class on Wednesday 2/9, e-mail owner-cse326@cs.washington.edu, and we'll help you match up with a group.

You may choose how to divide the work under three conditions: first, document each team member's effort in the README file; second, work together and make sure all team members understand your answers to the questions below; and third, understand at least at a high level how your team members' code is structured and how it works. Remember to test your team's code as a whole to make sure that your portions work properly together! Also, be aware that except in extreme cases when you notify us in advance of the deadline, all team members will receive the same grade for the project.

We strongly encourage working in teams of three!. Teams of two must complete the same workload as teams of three.

What we've provided

As a starting point we've provided a few files. Files that you might need to change are marked with a star (*).

Dictionary.h
This defines the abstract base class Dictionary (representing the Dictionary ADT). Your AVL and splay classes extend BinarySearchTree which itself extends Dictionary.
BinarySearchTree.h,BinarySearchTree.cpp
These define the concrete BinarySearchTree class which extends Dictionary. Your AVL and splay classes will extend this class. This file also defines the TreeIterator class (which should work unaltered on your classes) and the BSTNode class (from which you may need to extend an AVLNode class for AVL trees).
WordCounter.h,WordCounter.cpp
These define a class for counting words (WordCounter) and a class for turning a WordCounter into a frequency ranked list of words (FrequentWordList). Each of these requires that you pass in a data structure when you instantiate them (a Dictionary for WordCounter and a PriorityQueue for FrequentWordList).
*word_count.cpp
This is the main program file for the word counting program. You should change this to:
*test.cpp
This is a program file for testing your dictionary. It reads characters from stdin, inserts them into the tree, and prints the tree after each end-of-line. You can change this to use and test your splay and AVL trees.
PQueue.h, UnsLLPQueue.h, UnsLLPQueue.cpp
Files from previous projects. You should only need PQueue.h after you substitute your PriorityQueue for the UnsLLPQueue.
*Makefile
make will use this to build your program and the sort program. You will need to modify the makefile to add any new files you create. We have used a new Makefile format which requires fewer changes on your part. See the Makefile (especially if you want your code to run fast!).
AVLTree.h
This file is a sample of how you might declare your AVLTree class. Note that we declare an AVLNode class which extends BSTNode for use inside AVLTree. You need not use our AVLTree.h, but it should be a good starting point.
All provided files can be found in the project directory: /cse/courses/cse326/00wi/project3.

Input files

Any text file is a reasonable input file. We've provided a number of interesting ones from Project Gutenberg which reside in the directory /cse/courses/cse326/00wi/texts/. Please do not copy these to your directories! They take quite a bit of disk space as is. If you'd like to be able to reference them more easily, use the following command in your project directory:

ln -s /cse/courses/cse326/00wi/texts .
(Be sure to include the '.'!) You should now be able to reference them as if the directory texts were a subdirectory of your project directory.

Note that we will use bible_kingjames (the largest single file available) and usr_dict_words (a hefty, sorted-order file) as two of the test files. So, if your program does not run on these, you should try to fix it!

Handing in the code

You are required to hand in your project3 directory with all of your code. Included in this directory should be all the files listed above as well as any new files you create including your README. Make sure you turn in your heap implementation!

Note: of the files we provide to you, you should need to change only Makefile, word_count.cpp, test.cpp , and AVLTree.cpp (of course, you may create whatever files you wish). If you want to change any other files in the project, contact one of us first!

Turn your files in electronically using turnin:

% cd ~/326stuff/myproj3dir
% make full_clean
% cd ../
% turnin -c cse326=AA -p p3 myproj3dir

The README

Your README file should document your submission (though it does not by any means replace comments in the code!). It should contain the following:

Also answer the following questions (justify your answers, of course; 3-4 sentences apiece should be fine):

Additional utilities

We've provided a few utilities to help you develop and test your program. These are located in the directory project3/bin.

bin/just_words.sh

This is a script that takes a text file and converts it to an alphabetical listing of just the words in the file. Note that this uses a slightly different format for constructing words than the WordCounter code. You might use this in testing and experiments. It reads a file from standard input and writes the list of words to standard output.

Sample word counter

The executable bst_word_count is simply the compiled version of the code in the project3 directory. It uses a BST to store the words. Run bst_word_count -h for more details.

Extras (bonus)

Extra credit will add points to your score, averaging out previous poor scores or potentially increasing your project average above 100% (such a score will count toward the "extra 10%" from the score breakdown). I know there's a lot of extra credit here; that's so that you might all be able to find something that piques your interest. We encourage you to do as much as you are interested in and have time for. However, we're limiting the total extra credit points on this assignment to 15 extra points (so, you can get a maximum of 15 extra points for a maximum total of 115% on the assignment). Make sure you document any extra credit you do in your README!

For many of these extra credits, I can point you to sections of the book, other references, or ideas of my own which will make your efforts more effective. If you're interested in doing one of these, you might contact me to see if I have any references for you (wolf@cs.washington.edu).