University of Washington

CSE 326, Data Structures

Project #2: Word Counting

Due: Friday, February, 16, 2001, 1:30pm

Winter 2001 February 8, 2001

Donald Chinn

 

Word counting is used in all sorts of applications, including text analysis, creating indexes, and even cryptography. In this programming assignment, you will take a text file and compute what words appear in that file and how many times they appear.

The objectives in this assignment are: (1) to increase your familiarity with binary search trees and AVL trees, (2) learn how to conduct experiments to compare the performance of data structures and their implementations, (3) understand how different inputs can affect performance, and (4) reading and modifying other people’s code.

In this assignment, you will create a dictionary, inserting each string into it as each is read from the file. The dictionary will be implemented in three different ways: as a binary search tree, as an AVL tree that uses a recursive insert algorithm, and as an AVL tree with an iterative insert algorithm. Then you will write the strings in sorted order, along with their counts, to an output file. You will run the code on input files of different sizes and different contents to see which data structures seem to perform better in practice.

The Assignment

Here is a step-by-step list of things you need to do for this assignment, in the approximate order that you should do them:

What Code is Provided

We have modified code from Weiss’s web site and added a driver program as follows:

In addition, dsexceptions.h, myInstantiations.cpp, and a makefile have been written. The makefile compiles into the executable program ‘wordcount’.

Note that the #include "AvlTree.cpp" line (and the corresponding one in BST) have been commented out. If you do development on Visual C++, you might need to put the #include back into the code.

You may use the modified code above, or you may download Weiss’s code and write your own routines (why you would do this is not clear), or you may write all of the code yourself (again, it’s not clear why you would do this).

What to Turn In

As in Project #1, you will turn in your code electronically using the turnin program. In addition to the code, please have in your directory a README file that describes any unusual changes you made to the original BST/AVL code. Also, the README file should contain a short report about your experiments. (Alternatively, you may turn in a legible paper report. This might make plotting and reading graphs easier for everyone involved.)

For your report, here are some questions you should try to answer:

  1. How does the performance of BST and the two AVL algorithms perform as the size of the text and the size of the trees grows? (Note that the text size might be large, but the trees might be small if there are a lot of repeated strings.)
  2. On the worst case input for BST (a file of sorted words, for example), how big does the tree have to get before the AVL tree implementations perform better?
  3. For normal text inputs (for example, fiction, web pages, etc.), do the most frequently used words tend to be the same words?
  4. Are there any differences in word frequencies from different input files (for example, English text versus C++ code)?
  5. Are there any other patterns or phenomena you observed?

Software Engineering Tips

First read through the changes in BST/AVL code. Look through main.cpp to see what it is doing and what you will need to modify to conduct your experiments. Write the easy code first (the counter to the BST/AVL classes, the inorder traversal to print out the strings and their counts). Make sure they work before writing the hard code, which is the iterative insert.

You will need to implement some sort of stack of pointers to BinaryNode/AvlNode. It should be fairly easy to copy over the stack code from Weiss’s web site, but you will need to be careful to make sure there aren’t any inappropriate #include’s and to make sure you modify the makefile and myInstantiations.cpp correctly. A suggestion: make sure code that just declares and instantiates a stack in your insert function compiles before writing the real iterative insert code.

When implementing and debugging the iterative insert function, use small test examples and use the PrintTreeDepth routine to see whether your implementation is working. For example, you should create a small test input file(s) that demonstrates all four cases for rotation and then verify that each type of rotation is working correctly. Make sure that no matter at what level the rotations occur, your implementation works.

When it comes time to do experiments, think about how many files and what kinds of files you want to test. Don’t just run the program on ten random files and draw conclusions from that. Try to be scientific about your experiments.

 

Bonus (2 points)

Run experiments on a Splay tree implementation (for example, Weiss’s code) and compare its performance against the BST and AVL tree implementations. Include your observations in your report.

Bonus (5 points)

Create a hash table implementation and compare its performance against the BST, AVL, and/or Splay tree implementations. Carefully choose your hash function so that it is fast yet distributes strings evenly across the table (or experiment with different implementations). Start with a table of moderate size (say, a prime near 200) and implement a scheme so that when the table is full enough (you decide the criteria), it rehashes to a table approximately twice as large. Include your observations in your report.

Note that these bonus points involve significantly more work than the equivalent "regular" points, so make sure the regular part of the assignment is done before attempting the bonus exercises.

Finally, please note that this project is an individual project — no teams. We will have (2-person) teams in Project #3, so you might start considering with whom you want to be paired.