University of
Washington
CSE 373, Data Structures and Algorithms
Project #2: Word Counting
Due: Monday, November 5, 2001, 12:30pm
Autumn 2001 October 29, 2001
Donald Chinn
Word counting is used in all sorts of
applications, including text analysis, creating indexes, and even
cryptography. In this programming
assignment, you will take a text file and compute what words appear in that
file and how many times they appear.
In this assignment, you will create a
dictionary, inserting each string into it as each is read from the file. The dictionary will be implemented in four
different ways: as a binary search tree, as an AVL tree, as a splay tree, and
as a hash table. You will then determine
which words appeared in the file most frequently. You will run the code on input files of different sizes and
different contents to see which data structures seem to perform better in
practice.
The
Assignment
Here is a step-by-step list of things you need
to do for this assignment, in the approximate order that you should do them:
·
Copy the files below to a directory.
·
Create a new project and add ONLY main.cpp to
the project. (All the other files in
the directory will appear as External Dependencies when you compile.)
·
Modify the BST, AVL, splay tree, and hash table
templates to have an integer counter in each node, representing the number of
times that string has appeared in the text file. (You will need to identify places, such as the constructors for
the trees, where code needs to be changed.)
·
For the four insert functions, add code so that
if the insert finds that the item is already there, add 1 to the counter.
·
Write a routine for the BinarySearchTree class,
called PrintMostFrequent, that prints out the five strings (along with
their counts) that have the highest counts.
If there is a tie for fifth place, you need only print out one of
them. You will want PrintMostFrequent
to call a function (call it GetMostFrequent) that recursively traverses the
tree to find the strings with the highest counts.
Whenever the driver
program (see below) uses the binary search tree for the dictionary, it will
call PrintMostFrequent after all words have been inserted.
·
Run the program using all four algorithms on
three different kinds of input files: (1) sorted input, (2) standard English
text, and (3) a type of text of your choosing (some examples are: poetry, C++
code from Weiss’s website or any other source). Also, run the algorithms on texts of different sizes (enough
range to be able to plot the results on a graph to detect any trends in
performance). For each algorithm/input
file pair, run the program three times and take the average.
Note that to run the
programs in VC++, you need to go to the Project/Settings… menu item (and then
the Debug tab) and add, for example, -b king_james_bible.txt to the
Program arguments field to run with the BST dictionary on the King James
Bible. Also, be sure the active
configuration (either Release or Debug) matches the one you just set. (Go to Build/Set Active Configuration… to set the active configuration.)
·
Write a short report describing what types of
files you performed your experiments on and what conclusions you made. Your report should include several graphs,
where the x-axis is the size of the file being used as input (label each point
by its filename) and the y-axis is the running time. For each type of input (English text, sorted words, C++ code,
etc.), you should plot the running times of all four algorithms on the same
graph so that you can see which performs better. (Thought question: what other quantities might be appropriate to
be used on the x-axis instead of file size?)
·
Print out a hardcopy of your PrintMostFrequent
function and turn it in.
What
Code is Provided
We have modified code from Weiss’s web site and
added a driver program as follows:
·
BinarySearchTree.{h,cpp} — added a
PrintTreeDepth function, which allows you to visualize how the tree is
organized. There is a dummy function
for the PrintMostFrequent function you will need to write.
·
AvlTree.{h,cpp} — added a PrintTreeDepth as in
the BST code. If you choose to do the
bonus part of project (see below), this will be useful when you are debugging
your implementation of iterative insertion.
·
SplayTree.{h,cpp} — modified the code a little
so that it can coexist with BinarySearchTree.{h,cpp}.
·
QuadraticProbing.{h,cpp} — hash table using
quadratic probing and with resizing
·
vector.h, vector.cpp — used in the hash table
code.
·
mystring.h, string.cpp — added a function called
getword , which gets the next sequence of contiguous alphanumeric characters
from the input stream. This is what we
will consider a word to be for the purposes of this assignment. Take a look at main.cpp to see how it is
used.
·
main.cpp — a driver program that parses the
command line to determine which algorithm is supposed to be used, and what
input file is used. It also sets up a
timer to time how long it takes to read in and insert the strings into the
dictionary.
In addition, dsexceptions.h is provided.
What Text Files are Provided
As an example of a large piece of text, the King
James Bible is provided. This was
obtained from Project Gutenburg, whose website is http://promo.net/pg . You might find that website a good source of
text. Web pages (saved as text files)
are another good source of text. There
are also five files of different sizes whose words appear in sorted order.
What
to Turn In
As in Project #1, we will use the electronic
turnin program. But instead of turning
in code, you will turn in the text files that you used for your experiments:
·
Place all of your text files (except the
King James Bible and the files with sorted words) into a directory and zip them
using WinZip. (Start up WinZip and just
click on “New”, select a name for your archive (call it textfiles.zip), select
the files you want to place into the archive, and a file named textfiles.zip
will appear in your directory.)
textfiles.zip is what you will turn in electronically.
In addition to the text files, you will turn in
(hard copy) a short report describing the results of your experiments. It should include (at least) three graphs,
one for each type of text input (English text, sorted input, and the type of
your choice). The King James Bible
should be one of the English text files you use. The three graphs will compare the insertion times of the
different data structures on different text size. See “The Assignment” section for a description of how the graphs
should be plotted.
For your report, here are questions you should
answer:
1. For
each of the files you ran tests on, what were the five most frequent words that
appeared in the file?
2. How
long does it take just to read in the King James Bible file without doing any
kind of insertions into a data structure?
(Hint: Use the –i option. The –i
option is just a dummy function. It
will call real code if you do the bonus part of the project.)
3. How
do the four algorithms perform as the size of the text and the size of the trees
grows? (Note that the text size might
be large, but the trees might be small if there are a lot of repeated strings.)
4. On
the worst case input for BST (a file of sorted words, for example), about how
big does the tree have to get before the AVL tree implementation perform
better?
5. For
normal text inputs (for example, fiction, web pages, etc.), do the most
frequently used words tend to be the same words?
6. Are
there any differences in word frequencies from different input files (for
example, English text versus C++ code)?
7. Are
there any other patterns or phenomena you observed?
Finally, you should turn in a hard copy of your PrintMostFrequent
function (not the entire BinarySearchTree.cpp!).
Software
Engineering Tips
First read through the changes in BST/AVL
code. Look through main.cpp to see what
it is doing to help you conduct your experiments. Write the easy code first (the counter to the classes) and then
the harder code (PrintMostFrequent). Most
importantly, understand what the code is doing before modifying any
of it (especially main.cpp).
When it comes time to do experiments, think
about how many files and what kinds of files you want to test. Don’t just run the program on a bunch of
random files and draw conclusions from that.
Be scientific about your experiments by deciding beforehand what
kinds of things you are looking for and then collecting text files of a
particular kind.
Bonus
(10 points)
Create a function for the AvlTree template
called insert_iter , which inserts an item, but does so iteratively rather than
recursively. Run the code on the text
files you used in this project (use the –i option) and report your performance
results.
You will turn in (electronically) AvlTree.cpp
. After adding stack code, this should
be the only file you have to modify
to do the bonus.
You will need to have some sort of stack of
pointers to BinaryNode/AvlNode. Copy
the stack code from Weiss’s web site.
You will need to be careful to make sure the right #include’s are in
your files. A suggestion: make sure
code that just declares and instantiates a stack in your insert function
compiles before writing the real
iterative insert code.
When implementing and debugging the iterative
insert function, use small test examples and use the PrintTreeDepth routine to
see whether your implementation is working.
For example, you should create a small test input file(s) that
demonstrates all four cases for rotation and then verify that each type of
rotation is working correctly. Make
sure that no matter at what level the rotations occur, your implementation
works.
Note that these bonus points involve
significantly more work than the equivalent “regular” points, so make sure the
regular part of the assignment is done before attempting this bonus exercise.