University of Washington

CSE 326, Data Structures

Project #2: Word Counting

Due: Friday, February, 16, 2001, 1:30pm

Winter 2001 February 8, 2001

Donald Chinn

Word counting is used in all sorts of applications, including text analysis, creating indexes, and even cryptography. In this programming assignment, you will take a text file and compute what words appear in that file and how many times they appear.

The objectives in this assignment are: (1) to increase your familiarity with binary search trees and AVL trees, (2) learn how to conduct experiments to compare the performance of data structures and their implementations, (3) understand how different inputs can affect performance, and (4) reading and modifying other people’s code.

In this assignment, you will create a dictionary, inserting each string into it as each is read from the file. The dictionary will be implemented in three different ways: as a binary search tree, as an AVL tree that uses a recursive insert algorithm, and as an AVL tree with an iterative insert algorithm. Then you will write the strings in sorted order, along with their counts, to an output file. You will run the code on input files of different sizes and different contents to see which data structures seem to perform better in practice.

The Assignment

Here is a step-by-step list of things you need to do for this assignment, in the approximate order that you should do them:

Modify the BST and AVL templates to have an integer counter in each node, representing the number of times that string has appeared in the text file. (You will need to identify places, such as the constructors for the trees, where code needs to be changed.)

For both the insert (in BST and AVL) and insert_iter functions (in AVL), if the insert finds the item already there, add 1 to the counter.

Create a function for the AvlTree template called insert_iter , which inserts an item, but does so iteratively rather than recursively.

Create a function that prints out to an output stream in sorted order the items and their counts in your dictionary.

Run the program using all three algorithms on a variety of input files. "Variety" means files of different sizes (enough range to be able to plot the results on a graph to detect any trends in performance) and different types (standard text, C++ programs, poetry, etc.).

Write a short report describing what types of files you performed your experiments on and what conclusions you made.

What Code is Provided

We have modified code from Weiss’s web site and added a driver program as follows:

BinarySearchTree.{h,cpp} — added a PrintTreeDepth function, which allows you to visualize how the tree is organized.

AvlTree.{h,cpp} — added a PrintTreeDepth as in the BST code. This will be useful when you are debugging your implementation of iterative insertion.

mystring.h, string.cpp — added a function called getword , which gets the next sequence of contiguous alphanumeric characters from the input stream. This is what we will consider a word to be for the purposes of this assignment. Take a look at main.cpp to see how it is used.

main.cpp — a driver program that parses the command line to determine which algorithm is supposed to be used, and what input/output files are to be used. It also sets up a timer to time how long it takes to read in and insert the strings into the dictionary.

In addition, dsexceptions.h, myInstantiations.cpp, and a makefile have been written. The makefile compiles into the executable program ‘wordcount’.

Note that the #include "AvlTree.cpp" line (and the corresponding one in BST) have been commented out. If you do development on Visual C++, you might need to put the #include back into the code.

You may use the modified code above, or you may download Weiss’s code and write your own routines (why you would do this is not clear), or you may write all of the code yourself (again, it’s not clear why you would do this).

What to Turn In

As in Project #1, you will turn in your code electronically using the turnin program. In addition to the code, please have in your directory a README file that describes any unusual changes you made to the original BST/AVL code. Also, the README file should contain a short report about your experiments. (Alternatively, you may turn in a legible paper report. This might make plotting and reading graphs easier for everyone involved.)

For your report, here are some questions you should try to answer:

How does the performance of BST and the two AVL algorithms perform as the size of the text and the size of the trees grows? (Note that the text size might be large, but the trees might be small if there are a lot of repeated strings.)
On the worst case input for BST (a file of sorted words, for example), how big does the tree have to get before the AVL tree implementations perform better?
For normal text inputs (for example, fiction, web pages, etc.), do the most frequently used words tend to be the same words?
Are there any differences in word frequencies from different input files (for example, English text versus C++ code)?
Are there any other patterns or phenomena you observed?

Software Engineering Tips

First read through the changes in BST/AVL code. Look through main.cpp to see what it is doing and what you will need to modify to conduct your experiments. Write the easy code first (the counter to the BST/AVL classes, the inorder traversal to print out the strings and their counts). Make sure they work before writing the hard code, which is the iterative insert.

You will need to implement some sort of stack of pointers to BinaryNode/AvlNode. It should be fairly easy to copy over the stack code from Weiss’s web site, but you will need to be careful to make sure there aren’t any inappropriate #include’s and to make sure you modify the makefile and myInstantiations.cpp correctly. A suggestion: make sure code that just declares and instantiates a stack in your insert function compiles before writing the real iterative insert code.

When implementing and debugging the iterative insert function, use small test examples and use the PrintTreeDepth routine to see whether your implementation is working. For example, you should create a small test input file(s) that demonstrates all four cases for rotation and then verify that each type of rotation is working correctly. Make sure that no matter at what level the rotations occur, your implementation works.

When it comes time to do experiments, think about how many files and what kinds of files you want to test. Don’t just run the program on ten random files and draw conclusions from that. Try to be scientific about your experiments.

Bonus (2 points)

Run experiments on a Splay tree implementation (for example, Weiss’s code) and compare its performance against the BST and AVL tree implementations. Include your observations in your report.

Bonus (5 points)

Create a hash table implementation and compare its performance against the BST, AVL, and/or Splay tree implementations. Carefully choose your hash function so that it is fast yet distributes strings evenly across the table (or experiment with different implementations). Start with a table of moderate size (say, a prime near 200) and implement a scheme so that when the table is full enough (you decide the criteria), it rehashes to a table approximately twice as large. Include your observations in your report.

Note that these bonus points involve significantly more work than the equivalent "regular" points, so make sure the regular part of the assignment is done before attempting the bonus exercises.

Finally, please note that this project is an individual project — no teams. We will have (2-person) teams in Project #3, so you might start considering with whom you want to be paired.