University of Washington

CSE 373, Data Structures and Algorithms

Project #2: Word Counting

Due: Monday, November 5, 2001, 12:30pm

 

Autumn 2001                                                                                       October 29, 2001

Donald Chinn

 

Project Objectives

 

Word counting is used in all sorts of applications, including text analysis, creating indexes, and even cryptography.  In this programming assignment, you will take a text file and compute what words appear in that file and how many times they appear.

 

In this assignment, you will create a dictionary, inserting each string into it as each is read from the file.  The dictionary will be implemented in four different ways: as a binary search tree, as an AVL tree, as a splay tree, and as a hash table.  You will then determine which words appeared in the file most frequently.  You will run the code on input files of different sizes and different contents to see which data structures seem to perform better in practice.

 

The Assignment

 

Here is a step-by-step list of things you need to do for this assignment, in the approximate order that you should do them:

·        Copy the files below to a directory.

·        Create a new project and add ONLY main.cpp to the project.  (All the other files in the directory will appear as External Dependencies when you compile.)

·        Modify the BST, AVL, splay tree, and hash table templates to have an integer counter in each node, representing the number of times that string has appeared in the text file.  (You will need to identify places, such as the constructors for the trees, where code needs to be changed.)

·        For the four insert functions, add code so that if the insert finds that the item is already there, add 1 to the counter.

·        Write a routine for the BinarySearchTree class, called PrintMostFrequent, that prints out the five strings (along with their counts) that have the highest counts.  If there is a tie for fifth place, you need only print out one of them.  You will want PrintMostFrequent to call a function (call it GetMostFrequent) that recursively traverses the tree to find the strings with the highest counts.

 

Whenever the driver program (see below) uses the binary search tree for the dictionary, it will call PrintMostFrequent after all words have been inserted.

·        Run the program using all four algorithms on three different kinds of input files: (1) sorted input, (2) standard English text, and (3) a type of text of your choosing (some examples are: poetry, C++ code from Weiss’s website or any other source).  Also, run the algorithms on texts of different sizes (enough range to be able to plot the results on a graph to detect any trends in performance).  For each algorithm/input file pair, run the program three times and take the average.

 

Note that to run the programs in VC++, you need to go to the Project/Settings… menu item (and then the Debug tab) and add, for example, -b king_james_bible.txt to the Program arguments field to run with the BST dictionary on the King James Bible.  Also, be sure the active configuration (either Release or Debug) matches the one you just set.  (Go to Build/Set Active Configuration…  to set the active configuration.)

·        Write a short report describing what types of files you performed your experiments on and what conclusions you made.  Your report should include several graphs, where the x-axis is the size of the file being used as input (label each point by its filename) and the y-axis is the running time.  For each type of input (English text, sorted words, C++ code, etc.), you should plot the running times of all four algorithms on the same graph so that you can see which performs better.  (Thought question: what other quantities might be appropriate to be used on the x-axis instead of file size?)

·        Print out a hardcopy of your PrintMostFrequent function and turn it in.

 

What Code is Provided

 

We have modified code from Weiss’s web site and added a driver program as follows:

·        BinarySearchTree.{h,cpp} — added a PrintTreeDepth function, which allows you to visualize how the tree is organized.  There is a dummy function for the PrintMostFrequent function you will need to write.

·        AvlTree.{h,cpp} — added a PrintTreeDepth as in the BST code.  If you choose to do the bonus part of project (see below), this will be useful when you are debugging your implementation of iterative insertion.

·        SplayTree.{h,cpp} — modified the code a little so that it can coexist with BinarySearchTree.{h,cpp}.

·        QuadraticProbing.{h,cpp} — hash table using quadratic probing and with resizing

·        vector.h, vector.cpp — used in the hash table code.

·        mystring.h, string.cpp — added a function called getword , which gets the next sequence of contiguous alphanumeric characters from the input stream.  This is what we will consider a word to be for the purposes of this assignment.  Take a look at main.cpp to see how it is used.

·        main.cpp — a driver program that parses the command line to determine which algorithm is supposed to be used, and what input file is used.  It also sets up a timer to time how long it takes to read in and insert the strings into the dictionary.

 

In addition, dsexceptions.h is provided.

 

What Text Files are Provided

 

As an example of a large piece of text, the King James Bible is provided.  This was obtained from Project Gutenburg, whose website is http://promo.net/pg .  You might find that website a good source of text.  Web pages (saved as text files) are another good source of text.  There are also five files of different sizes whose words appear in sorted order.

 

What to Turn In

 

As in Project #1, we will use the electronic turnin program.  But instead of turning in code, you will turn in the text files that you used for your experiments:

·        Place all of your text files (except the King James Bible and the files with sorted words) into a directory and zip them using WinZip.  (Start up WinZip and just click on “New”, select a name for your archive (call it textfiles.zip), select the files you want to place into the archive, and a file named textfiles.zip will appear in your directory.)  textfiles.zip is what you will turn in electronically.

 

In addition to the text files, you will turn in (hard copy) a short report describing the results of your experiments.  It should include (at least) three graphs, one for each type of text input (English text, sorted input, and the type of your choice).  The King James Bible should be one of the English text files you use.  The three graphs will compare the insertion times of the different data structures on different text size.  See “The Assignment” section for a description of how the graphs should be plotted.

 

For your report, here are questions you should answer:

1.      For each of the files you ran tests on, what were the five most frequent words that appeared in the file?

2.      How long does it take just to read in the King James Bible file without doing any kind of insertions into a data structure?  (Hint: Use the –i option.  The –i option is just a dummy function.  It will call real code if you do the bonus part of the project.)

3.      How do the four algorithms perform as the size of the text and the size of the trees grows?  (Note that the text size might be large, but the trees might be small if there are a lot of repeated strings.)

4.      On the worst case input for BST (a file of sorted words, for example), about how big does the tree have to get before the AVL tree implementation perform better?

5.      For normal text inputs (for example, fiction, web pages, etc.), do the most frequently used words tend to be the same words?

6.      Are there any differences in word frequencies from different input files (for example, English text versus C++ code)?

7.      Are there any other patterns or phenomena you observed?

 

Finally, you should turn in a hard copy of your PrintMostFrequent function (not the entire BinarySearchTree.cpp!).

 

 

Software Engineering Tips

 

First read through the changes in BST/AVL code.  Look through main.cpp to see what it is doing to help you conduct your experiments.  Write the easy code first (the counter to the classes) and then the harder code (PrintMostFrequent).  Most importantly, understand what the code is doing before modifying any of it (especially main.cpp).

 

When it comes time to do experiments, think about how many files and what kinds of files you want to test.  Don’t just run the program on a bunch of random files and draw conclusions from that.  Be scientific about your experiments by deciding beforehand what kinds of things you are looking for and then collecting text files of a particular kind.

 

 

Bonus (10 points)

Create a function for the AvlTree template called insert_iter , which inserts an item, but does so iteratively rather than recursively.  Run the code on the text files you used in this project (use the –i option) and report your performance results.

 

You will turn in (electronically) AvlTree.cpp .  After adding stack code, this should be the only file you have to modify to do the bonus.

 

You will need to have some sort of stack of pointers to BinaryNode/AvlNode.   Copy the stack code from Weiss’s web site.  You will need to be careful to make sure the right #include’s are in your files.  A suggestion: make sure code that just declares and instantiates a stack in your insert function compiles before writing the real iterative insert code.

 

When implementing and debugging the iterative insert function, use small test examples and use the PrintTreeDepth routine to see whether your implementation is working.  For example, you should create a small test input file(s) that demonstrates all four cases for rotation and then verify that each type of rotation is working correctly.  Make sure that no matter at what level the rotations occur, your implementation works.

 

Note that these bonus points involve significantly more work than the equivalent “regular” points, so make sure the regular part of the assignment is done before attempting this bonus exercise.