Project 3
History Sleuths!
Due 10 PM, Wednesday, July 31st
I. Introduction
You have just been approached by a world famous UW history
professor. He would like you to settle
a centuries-old debate on who wrote Shakespeare's plays,
Shakespeare or
Sir Francis
Bacon? You protest that this question is surely outside of your
area of expertise. "Oh, no," chuckled the historian, stroking his snowy
white beard. "I need a Computer Scientist!"
II. Learning Objectives
For this assignment, you will:
- Be introduced to the DictionaryADT, understand the
strengths of various DictionaryADT implementations.
- Gain proficiency with Splay or AVL trees.
- Become comfortable with implementing a HashTable.
- Experience the joys and pains of code reuse.
- Develop a means to test your code (a testing harness).
III. Word Frequency Analysis
The professor suspects that some authors use particular words more often
than others. He hopes that if we study the frequency with which authors
use words, we may be able to come up with a word usage "signature" for
that author. This signature should be quite consistant across a particular
author's works, but vary greatly between authors.
The professor wants you to compare three works of Shakespeare
(Hamlet,
All's Well That
Ends Well, and one Shakespearian work of your choice)
with three of Bacon's works:
(The New Atlantis,
The Essays, and one Bacon text of
your choice). You will do this by counting the number of times that
each word occurs in each text. The output should be in the following
format:
970 the
708 and
666 of
632 to
521 at
521 i
521 into
466 a
444 my
391 in
383 you
... where the first number is the frequency that the second string
occurs in the text. The output should be in decreasing order of
frequency, with words that share the same frequency being sorted in
alphabetical order. Strangely enough, the professor wants you to hand
in this project using the turnin program to the CSE 326 course
staff. He would like a copy of your source code, compilation
instructions, and a 1-2 paragraph answer to his question: based on
the data you have accumulated, did Bacon write Shakespeare's plays?
IV. Teams
You may work on a team of at most two for this project. You may
choose how to divide the work, under three conditions: first, document
each team member's effort in the README file; second, work together
and make sure everybody understands your answers to the
README questions below; and third, understand (at least) at a high
level how your team members' code is structured and how it works.
Remember to test your team's code as a whole to make sure that
your portions work together properly! Also, be aware that except in
extreme cases when you notify us in advance of the deadline,
all team members will receive the same grade for the project.
Since you will be working in groups for the rest of your academic and
professional careers, we strongly encourage you to work in a team for
this project! Feel free to ask for partner(s) over the
cse326@cs discussion list. Students working alone will be
required to implement one additional Above
and Beyond project. If you choose to work in a team,
you will be required to implement two additional
Above and Beyond projects. These additional
projects are part of your required work, and will not count towards
extra credit.
V. Project and Program Requirements
Your word counting program should implement the following
algorithm:
Open the specified text file (specified on the command line)
For each string s in the file
"Clean up" s, so that it is a valid English word (all
letters should be lower case, preceeding and trailing
punctuation should be removed; you can also make any other
changes you may find interesting).
Load s into your Dictionary as a key; its value is the
frequency that s has occured so far.
When the file is completely processed, sort the (word, frequency)
pairs by frequency
Print the (frequency, word) pairs.
You will find that the text files we give you contain some
extraneous information (e.g. copyright stuff). Don't waste your
time trying to delete this, as it will have a negligable effect
on the output.
For the base requirements, you must:
- Implement the word-counting program as specified above.
- Give one hash table and one search-tree implementation of the
DictionaryADT. You may choose which hash table and
which search-tree you will implement, but they must be used in your
word counting program, and the user should be able to choose their
data structure based on the command-line arguments specified below.
- Sort the word frequencies using HeapSort (hint: you
can reuse your heap from project 2).
- Submit two stand-alone programs (called unit
tests) which test the basic functionality of your two
DictionaryADT implementations. See dictionary_test.cc or
DictionaryTest.java for an example of a unit test.
- Implement one Above and Beyond
extension (if you are working in a team, you will be required to
implement a total of at least two "Above and Beyond"
projects).
You should turn in three programs: two unit tests, and one
word-counting executable called word_count.
word_count should read the name of the input file from the
command prompt and print its results to stdout (cout
in C++, and System.out in Java). In addition, the user
should be able to specify which DictionaryADT implementation
word_count will utilize with the following command-line
arguments (you do not need to implement all of these features; some
of these are Above and Beyond
options!). Do not deviate from the specified input format, since
we will be using automated scripts to test your program.
- -b - Count frequencies using an unbalanced binary
search tree (this code is provided)
- -a - Count frequencies using an AVL tree
- -s - Count frequencies using a splay tree
- -ho - Count frequencies using an open-hashing hash
table
- -hc1 - Count frequencies using an closed-hashing hash
table with linear probing
- -hc2 - Count frequencies using an closed-hashing hash
table with quadratic probing
You may find it useful to use the magic of redirection in Unix, so
that you can test your program like this:
./word_count -a hamlet.txt > hamlet-freq.txt
(Counts the words in hamlet.txt with an AVL tree, and
places the result in hamlet-freq.txt)
Have fun, and uncover deep historical truths!
VI. Files and Sample Code
Sample texts are provided in the course directory, which is located at
/cse/courses/cse326/02su/project3/. You can also get texts of
your own at Project Gutenburg, which
has thousands of books as plain text files! Their mission is to provide
electronic versions of many popular public domain texts. Check it out!
Try running your word-counting program on the King James Bible.
(Guess which word comes up more frequently in the Bible: "he" or "she?"...
and by a factor of what?). Also, if you have any special requests for
texts or other cool files you'd like to have added to the test files,
email the course
staff.
In addition, your history professor has provided some code which he
wrote (these days, everybody knows how to program!). You may use it
if you wish, although your code must follow the provided
DictionaryADT interface (your professor wants to reuse your
Dictionary code elsewhere!):
- Dictionary - Specification of an abstract base
class/interface for a DictionaryADT. Your classes
must implement this interface/base class.
- BinarySearchTree - Specification and implementation of
an unbalanced binary search tree class. Use of the provided
BinarySearchTree implementation is optional (you may choose
to implement your own), but your AVL and Splay tree classes
must inherit from your BinarySearchTree
implementation.
- BSTNode- Specification and implementation of a
binary search tree node (used in the BinarySearchTree
class).
- dictionary_test - Test program which illustrates how to
insert elements into an unbalanced binary search tree and
sort its data.
- TemplateInst - Instantiation file (for C++
students only). Generates the object code for the templated
BinarySearchTree and BSTNode classes.
In addition, auto-generated documenation for the provided code is
available for C++ and Java. If you are wondering about the wierd commenting style in the source files,
these comments provide hints to some source documentation tools.
Specifically,
we used "doxygen" for
C++ and "javadoc" for Java.
Javadoc is a standard Java tool. Doxygen is a popular open-source
documentation generator. They are both very powerful. Check them out.
Doxygen is also fairly compatible with Javadoc so you can use it to document
Java code. However, Javadoc is the standard, so it was used here. Oh,
and for fun, aside from HTML, doxygen can generate
man pages
(and latex source with pdfs,
rtf, xml, ...)
VII. Submitting Your Work
You are expected to turn in all your code (use the turnin
utility to submit
your project. This project is called "project3" in the
turnin system). You will also need to include a README
which contains the following information:
- The names of your team members and your team's name.
- If you worked in a team, a breakdown of the work.
- (Answer this question before you begin) How long do you
think this project would take you to finish?
- How much time did you actually spend on this project?
- Acknowledgement of any assistance you received from anyone but
your team members, the 326 staff, or the Weiss book.
- An explanation of how to compile your program.
Also, please answer the following questions (justify your answers,
of course; 3-4 sentences each should be fine):
- Give a one to two paragraph explanation of whether or not you
think Bacon wrote Shakespeare's plays based on the data you
collected. No fancy statistical analyses here; keep it fun
and simple.
- How difficult was it to reuse your existing heap code in a new
context? What did/could have helped to make it easier?
- In general, which DictionaryADT implementation was
"better": trees or hash tables? Note that you will need to
define "better" (ease of coding, ease of debugging, memory
usage, disk access patterns, runtime for average input, runtime
for all input, etc).
- Give a detailed description of all Above and Beyond projects
which you implemented. What did you find most challenging
about the projects you implemented? What did you find most
interesting?
- Although they do not matter asymptotically, different constants
for your algorithm will affect your program's runtime. In
particular, your program will have to traverse your BST twice
for every word inserted: once to find the word, and a second
time to update the frequency. Describe how you might modify
your interfaces such that only one tree traversal is
necessary. You should try to keep your data structures
generalized and maintain good data hiding. Do you think these
constants will have a large impact on your program's runtime.
Is asymptotic analysis an accurate measure of your program's
runtime? Why?
- (For Java students) An often-touted and often-lamented feature
of Java is its object system. Specifically, creating a
container of Objects necessitates casting when objects are
removed from the container, and atomic types such as
int must be promoted to class instances (such as the
Integer class). Comment on some of the weaknesses or
strengths of Java's object system which you noticed in this
assignment.
- (For C++ students) An often-touted and often-lamented feature
of C++ is the const keyword. Specifically, attempting
to create a "safe" interface for a class often makes the class
implementor's job more complex (see the mutable
keyword in BinarySearchTree.hh for an example).
Comment on some of the weaknesses or strengths of C++'s
const keyword, especially in light of C++'s ability to
freely ignore an object's "constness" (using
the
const_cast operator).
- What literary charecter does Brian most remind you of? Hannah?
Albert?
- What did you enjoy about this assignment? What did you hate?
Could we have done anything better?
VIII. Grading Breakdown
Each part of the assignment will be worth (approximately) the
following percentages. Please budget your time appropriately.
40%
| Program correctness (including boundary cases) and
error-free compilation
|
25%
| Architecture/design of program. Style, commenting, layout.
Other forms of program documentation.
|
25%
| README and answers to writeup questions
|
10%
| Quality and comprehensiveness of turned-in unit tests
|
IX. Going Above and Beyond
- Alternative Program Implementations -
For this assignment, you had the ability to choose which type of
hash table and which type of tree to implement. Implement the
other type of tree or hash table. Each additional data structure is
worth one "Above and Beyond" project. For instance, if you
implemented a closed-hashing hash table and a splay tree, two
possible "Above and Beyond" projects would be implementing an
AVL tree and an open-hashing hash table. Write 1-2 paragraphs in
your README on the comparative runtime (you will need to measure
this somehow) and ease-of-coding for your additional data structures.
- Algorithm Design Techniques -
If you wrote your tree algorithms iteratively, re-implement them to
be recursive (or vice versa), and answer the following questions in
your README: Which algorithm design technique did you find
easier to code? Which was more elegant? Had a faster runtime?
How would you define "better", and which design technique do you
think is "better" in this program? Can you think of situations
where one design technique is not applicable?
- Alternative Trees -
Implement the program with Red-Black trees. The advantage of
Red-Black trees is that you can write non-recursive
insertion/deletion algorithms (the trade-off is that Red-Black trees
have weaker balancing condition, though they do guarantee O(log(n))
depth). Don't cheat; write non-recursive algorithms here. In
your README, comment on which tree implementation was easiest to
write/debug, which was the fastest, and, if you needed to write a
tree for general use (eg, a tree to be used by all the 326 students
for all their projects), which would it be: an unbalanced BST, an
AVL tree, a splay tree, or a red-black tree? Why?
- Keeping Performance Information -
Add code to your program so that you can track the number of
comparisons and the number of rotations performed by your tree.
For this project, you will need to have implemented both a splay
tree and an AVL tree. Predict how the two trees would
compare. How did they actually compare? Were you surprised?
- Alternative Hashing Strategies -
If you wrote a closed-hashing table, implement linear,
quadratic, and one other probing strategy (you may make up your
own, if you wish). The user should be able to select their
probing strategy with command line arguments. Does one probing
strategy always work better/worse than the others? Why do you
think this is the case? Are there types of input for which your
one probing strategy works better than another? Which has a
greater impact on your hash table's performance: the hash
function, or the probing strategy? If you wrote an open-hashing
table, implement a secondary dictionary instead of a linked list
(perhaps you can reuse your tree implementation?). In your
README, answer the following questions: does a secondary
dictionary increase or decrease the runtime for your hash
table for all inputs? On some inputs? How difficult was
it to implement a secondary dictionary?
- Data Locality -
Add code to your binary search tree which keeps track of the
average depth of a node in the tree over the course of a run.
Compare the average depths of some very common and some very
uncommon words for unbalanced binary search trees, AVL trees,
and splay trees over the course of parsing a file.
- Profiling -
Profile your program using
gprof
(for C++ students) or
hprof
(for Java students). A profiler is a tool which enables the
programmer to obtain detailed information about how their
program performs, such as the number of times a function is
called, or how much of the program's runtime was spent in a
particular function call. Compare two tree or two hash table
implementations using a profiler. Your README should include a
paragraph with the following information:
- Your expected performance bottlenecks
- How your expectations differed (or were the same) as
the profiler's results.
- Did you find anything unusual/unexpected?
- What is the biggest bottleneck for your program, and what
you think may be the cause.
- Deletion -
Currently, the DictionaryADT interface which we have
provided does not support deletion of elements. Add deletion to
the interface and to all data structures that you've written
which implement this interface.
- Algorithmic Analysis -
Implement SelectionSort and a sorting algorithm other than
HeapSort. This second algorithm should have an asymptotic runtime
equal to, or better than, HeapSort. Your README should include
a short paragraph comparing the three algorithms (including
plots of your sorting algorithms when run on your own unit
tests). You should make a convincing argument regarding the
comparative runtimes of SelectionSort, HeapSort, and your
chosen sorting algorithm.
If you do this extra credit, this sorting algorithm should
also be part of your word counter. Please add a command line argument
to word_count: -sort <name of sort>. Be sure
to document in your README the sorting algorithms that your
program can use.
- Introspective Sort -
Introspective
sort is an unstable QuickSort variant which switches
to HeapSort for inputs which would result in a O(n2) for
normal QuickSort. Thus, it has an average-case and a worst-case
runtime of O( n log2 n ), but generally runs faster than
HeapSort even in the worst case. Implement IntroSort, and
give a sample input which would result in a quadratic runtime for
normal QuickSort (using a median-of-3 partitioning scheme).
- Iterators -
Implement iterators for your DictionaryADT and the classes which
implement the DictionaryADT. Sample code will be provided for
this option. In your README, comment on whether you think the
added complexity of writing an iterator outweighs the
simplification of your algorithms, and any difficulties you
found while writing your iterators.
- Visualization -
We have provided you with a primitive method for printing out trees.
Make a full-blown tree visualization tool to better test and debug
tree code (Java has built-in graphics primitives; C++ students
can use the SimpleWindow class from the MazeRunner
assignment). This option is worth two "Above and Beyond"
projects.
- Word Stemming -
Word stemming is a process in which:
- Common word suffixes like "ing", "ed", and "ly" are removed,
- Plural words are made singular,
- ... well, you get the picture. Other simplifications
include removing prefixes and changing verbs to
nouns if possible.
So, a word-stemming word-frequency-counter would count the word
"buffalo" twice in the following sentance: "The bald buffalo
charged at the herd of shaggy buffalos".
Note that simply removing certain letters or strings at the end
of words will not work: "Swiss" is not the plural of "Swis", nor
is "bed" the past tense of "b". Simple word-stemming algorithms
can be found on the internet, so your best reference will probably
be a search engine like
Google. Please only use the
web as an algorithm reference; do not copy code directly.
Stemming algorithms of interest include the Porter
Stemming Algorithm (a complicated but very widely-used 5-step
suffix-removal algorithm) and the
Paice/Husk Stemming Algorithm (a simpler iterative
suffix-remover). This option is worth two "Above and
Beyond" projects.
- Word co-occurance -
A phenomenon even more interesting than word frequency is word
co-occurance. Create a new word_count that counts the
frequency of pairs of words. Your code should insert as a pair
any two words which appear within k words of each other, where k
is a command line argument (try 10 for test runs). How do BST,
AVL, and splay trees compare now? This option is worth two
"Above and Beyond" projects.
- Latent Semantic Analysis -
The underlying theory behind word co-occurance is what is known
as Latent Semantic Analysis. Check out the
LSA website at Colorado
University for more information, and modify the
word_count program to find possible
polysemies
or
synonymies.
This option is worth three "Above and Beyond" projects.
- Go crazy! -
We're letting you express your creative side here. Have a cool
idea that's not on this list? Then go for it! If you want to
go drastically beyond the basic project specifications, check
with Brian and Hannah before you start. Of course, your code
should still meet the basic requirements.
X. Interesting Tidbits
- Word frequency analysis plays an important role in
Cryptanalysis, the science of breaking secretly encoded
messages. The first mention of using the frequency distribution
of a language to break codes was in a 14-volume Arabic
encyclopedia written by al-Qalqashandi in 1412. The idea is
attributed to Ibn al-Duraihim, the brilliant 13th century Arab
Cryptanalyst. If you are interested in cryptography, be sure to
check out The
Code Book by Simon Singh. This is great introduction to
cryptography and cryptanalysis.
- Think computer science is all fun and games?
The Codebreakers, by David Kahn, is a fascinating look at
many of the problems solved by crypotanalysis, from breaking
WWII secret messages to the very question of who wrote
Shakespeare's works!
XI. Credits
This assignment is starting to become a fixture in 326. The first word
counter appeared during winter 2000, when 326 was taught by the famous
Steve Wolfman. The snowy-bearded historian appeared in Autumn 2001,
courtesy of Adrien Treuille. This assignment has been otherwise tweaked
over the years, most recently being last quarter by Matt Cary.