Homework 5 - History Sleuths!
Due at the midnight, after Monday February
25
I. Introduction
You have just been approached by a world famous UW history professor.
He would like you to settle a
centuries-old debate on who wrote Shakespeare's plays, Shakespeare
or Sir Francis Bacon?
You protest that this question is surely outside of your area expertise.
"Oh, no," chuckled the historian, stroking his snowy white beard. "I need
a Computer Scientist!"
II. Word Frequency Analysis
The professor suspects that some authors use particular words more often
than others. He hopes that if we study the frequency with which authors
use words, we may be able to come up with a word usage "signature" for
that author. This signature should be quite consistant across a particular
author's works, but vary greatly between authors.
The professor wants you to take two works of Shakespeare
(Hamlet, and
All's Well That
Ends Well), and two of Bacon
(The New Atlantis, and
The Essays), and count the number
of times that each word occurs in each. The output should be
in the following format:
970 the
708 and
666 of
632 to
521 I
466 a
444 my
391 in
383 you
...where the first number is the frequency that the second string occurs
in the text. Strangely enough, the professor wants you to hand in this
project using the turnin program to the CSE 326 database. He would like
a copy of your source code, your Makefile, and a 1-2 paragraph answer
to his question: based on the data you have
accumulated, did Bacon write Shakespeare's plays?
III. The Nitty Gritty
-
You may work on this project in teams of up to three.
-
You must implement the word counting algorithm as follows:
-
Load the strings as keys into several types of binary search
trees; the values are the frequency that this word has
occured so far. You will need to implement an unbalanced
binary search tree (starter code will be provided), an AVL tree, and a
splay tree.
-
Make your balanced binary tree classes inherit from your
unbalanced binary tree class.
-
Your frequency count should be case sensitive. That is, "Shakespeare"
and "sHaKeSpEaRe" should be counted seperately.
-
You should only skip over the following characters: the
space character (' '), the tab character ('\t'), the
newline character ('\n'), the carriage-return character
('\r'), and the form-feed character ('\f'). All
other characters should be considered part of words. Hint: you
may find certain library functions useful for deciding what is
"valid" input.
-
You must implement the frequency sorting algorithm with
HeapSort (hint: what data structure will you need to write to
implement HeapSort?). Extra credit
will be provided for implementing other sorting algorithms.
-
Your final program should consist of one executable, called
word-count. The user will specify the tree type your
program will use with command line arguments:
-
-b - Count frequencies using an unbalanced binary
search tree
-
-a - Count frequencies using an AVL tree
-
-s - Count frequencies using a splay tree
-
word-count will read its input from stdin
(cin in C++ terminology), and print to stdout
(cout in C++). Using the magic of redirection in
Unix, you can test your program like this:
-
./word-count -a < hamlet.txt > hamlet-freq.txt
Count the words in hamlet.txt with an AVL tree, and place
the result in hamlet-freq.txt
-
./word-count -b < the-essays.txt
Count the words in the-essays.txt with an unbalanced
binary search tree, and print the output onto the screen.
On this project, your code will be expressedly analyzed for its
efficiency. Since you are using large inputs, constants
matter. That is, the difference between a 2n2 and
a 4n2 algorithm will be noticeable. After
your code is working, you should analyze each algorithm to make sure
it is not doing unnecessary repetitive work, and then eliminate this excess.
As always, correctness will count more than efficiency, so make sure
your algorithms are correct before worrying about efficiency.
You will find that the text files we give you contain some
extraneous information (e.g. copyright stuff). Don't waste your
time trying to delete this, as it will have an negligable effect
on the output.
Have fun, and uncover deep historical truths!
IV. Files and Sample Code
-
The files we have provided for you:
For testing purposes, more texts are available on the instructional
machines, at
/cse/courses/cse326/course_archives/1999-2000/cse326/2000wi/texts.
-
Your history professor has provided some code which he wrote (these
days, everybody knows how to program!). You may use it
if you wish:
-
BinarySearchTree.hh
and
BinarySearchTree.cc
- Specification and implementation of an unbalanced binary
search tree class. The AVL and splay tree classes must
inherit from this class.
-
BSTNode.hh
and BSTNode.cc
- Specification and implementation of a binary search tree node
struct (used in the BinarySearchTree class).
-
bst-test.cc - Test
program, which illustrates how to insert elements into an
unbalanced binary search tree and sort its data.
-
TemplateInst.cc - Instantiation file. Generates
the object code for the templated BinarySearchTree and BSTNode
classes.
-
Makefile - Sample
Makefile which will compile the bst-test program.
-
The files you should turn in (using the Unix
turnin tool. This project is called "project5" in
the turnin system):
-
All of your C++ code
-
README - an explanation of your design decisions, the
issues that came up while writing this project, as well as
a one to two paragraph explanation of whether or not you think that
Bacon wrote Shakespeare's plays based on the data you collected.
No fancy statistical analyses here. Keep it fun and simple.
-
Makefile - A Makefile which compiles your
word-count program
V. Extra Credit!
-
Profiling - Profile all three trees using
gprof.
You should run your trees on at least two texts:
words.txt and a text of
your choice from Project
Gutenberg. Your README should include a short paragraph
with the following:
-
Your expected performance bottlenecks
-
How your expectations differed (or were the same) as the
gprof results. Did you find anything unusual/unexpected?
-
The biggest bottleneck for your program, and what you think
may be the cause
-
Algorithmic Analysis - Implement SelectionSort and a sorting
algorithm other than HeapSort. This second algorithm should
have a big-O runtime equal to, or better than, HeapSort. Your
README should include a short paragraph comparing the three
algorithms (including plots). You should make a convincing
argument regarding the comparative runtimes of SelectionSort,
HeapSort, and your chosen sorting algorithm.
If you do this extra credit, add the following command line argument
to word-count: -sort <name of sort>. Be sure to document in your
readme the allowed sorts that one can use.
- Word Stemming - Word stemming is a process in which:
- common word suffixes like "ing", "ed", and "ly" are removed,
- plural words are made singular
- ... well, you get the picture. Other simplifications include
removing prefixes and changing verbs to nouns if possible.
So, a word-stemming word-frequency-counter would count the word
"car" twice in the following sentance:
"The red car hit the blue cars".
Note that simply removing certain letters or strings
at the end of words will not work: "Swiss" is not the
plural of "Swis", nor is "bed" the past tense of "b".
Simple word-stemming algorithms can be found on the internet,
so your best reference willprobably be a search engine like
Google
VI. Interesting Tidbits
-
Word frequency analysis plays an important role in Cryptanalysis, the
science of breaking secretly encoded messages. The first mention of
using the frequency distribution of a language to to break codes was
in a 14-volume Arabic encyclopedia written by al-Qalqashandi in
1412. The idea is attributed Ibn al-Duraihim, the brilliant 13th
century Arab Cryptanalyst. If you are interested in cryptography, be
sure to check out The
Code Book by Simon Singh. This is great introduction to cryptography
and cryptanalysis.
-
Wondering where we got these text files? From the
Project Gutenburg site, which has thousands of books as plain
text files! Their mission is to provide electronic versions of many
popular public domain texts. Check it out! Try running your word counting
program on the King James
Bible. (Guess which word comes up more frequently in the Bible
"he" or "she?"... and by a factor of what?)
-
Think computer science is all fun and games?
The Codebreakers, by David Kahn, is a fascinating look at many of
the problems solved by crypotanalysis, from breaking WWII secret
messages to the very question of who wrote Shakespeare's works!
VII. Credits
Inspiration for this assignment comes from the
Winter 2000 offering
of CSE 326, and the Autumn 2001 CSE 326 TA, Adrien Treuille.