Homework 5 - History Sleuths!
Due at the midnight, after Monday February 25

I. Introduction

You have just been approached by a world famous UW history professor. He would like you to settle a centuries-old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside of your area expertise. "Oh, no," chuckled the historian, stroking his snowy white beard. "I need a Computer Scientist!"

II. Word Frequency Analysis

The professor suspects that some authors use particular words more often than others. He hopes that if we study the frequency with which authors use words, we may be able to come up with a word usage "signature" for that author. This signature should be quite consistant across a particular author's works, but vary greatly between authors.

The professor wants you to take two works of Shakespeare (Hamlet, and All's Well That Ends Well), and two of Bacon (The New Atlantis, and The Essays), and count the number of times that each word occurs in each. The output should be in the following format:

970 the
708 and
666 of
632 to
521 I
466 a
444 my
391 in
383 you

...where the first number is the frequency that the second string occurs in the text. Strangely enough, the professor wants you to hand in this project using the turnin program to the CSE 326 database. He would like a copy of your source code, your Makefile, and a 1-2 paragraph answer to his question: based on the data you have accumulated, did Bacon write Shakespeare's plays?

III. The Nitty Gritty

You may work on this project in teams of up to three.
You must implement the word counting algorithm as follows:

Load the strings as keys into several types of binary search trees; the values are the frequency that this word has occured so far. You will need to implement an unbalanced binary search tree (starter code will be provided), an AVL tree, and a splay tree.
Make your balanced binary tree classes inherit from your unbalanced binary tree class.
Your frequency count should be case sensitive. That is, "Shakespeare" and "sHaKeSpEaRe" should be counted seperately.
You should only skip over the following characters: the space character (' '), the tab character ('\t'), the newline character ('\n'), the carriage-return character ('\r'), and the form-feed character ('\f'). All other characters should be considered part of words. Hint: you may find certain library functions useful for deciding what is "valid" input.

You must implement the frequency sorting algorithm with HeapSort (hint: what data structure will you need to write to implement HeapSort?). Extra credit will be provided for implementing other sorting algorithms.
Your final program should consist of one executable, called word-count. The user will specify the tree type your program will use with command line arguments:
- -b - Count frequencies using an unbalanced binary search tree
- -a - Count frequencies using an AVL tree
- -s - Count frequencies using a splay tree
word-count will read its input from stdin (cin in C++ terminology), and print to stdout (cout in C++). Using the magic of redirection in Unix, you can test your program like this:
- ./word-count -a < hamlet.txt > hamlet-freq.txt
  Count the words in hamlet.txt with an AVL tree, and place the result in hamlet-freq.txt
- ./word-count -b < the-essays.txt
  Count the words in the-essays.txt with an unbalanced binary search tree, and print the output onto the screen.

On this project, your code will be expressedly analyzed for its efficiency. Since you are using large inputs, constants matter. That is, the difference between a 2n² and a 4n² algorithm will be noticeable. After your code is working, you should analyze each algorithm to make sure it is not doing unnecessary repetitive work, and then eliminate this excess. As always, correctness will count more than efficiency, so make sure your algorithms are correct before worrying about efficiency.

You will find that the text files we give you contain some extraneous information (e.g. copyright stuff). Don't waste your time trying to delete this, as it will have an negligable effect on the output.

Have fun, and uncover deep historical truths!

IV. Files and Sample Code

The files we have provided for you:

hamlet.txt (Hamlet, by Shakespeare)
alls-well-that-ends-well.txt (All's Well That Ends Well, by Shakespeare)
the-new-atlantis.txt (The New Atlantis, by Bacon)
the-essays.txt (The Essays, by Bacon)
words.txt (for extra credit, see below)

/cse/courses/cse326/course_archives/1999-2000/cse326/2000wi/texts

Your history professor has provided some code which he wrote (these days, everybody knows how to program!). You may use it if you wish:

BinarySearchTree.hh and BinarySearchTree.cc - Specification and implementation of an unbalanced binary search tree class. The AVL and splay tree classes must inherit from this class.
BSTNode.hh and BSTNode.cc - Specification and implementation of a binary search tree node struct (used in the BinarySearchTree class).
bst-test.cc - Test program, which illustrates how to insert elements into an unbalanced binary search tree and sort its data.
TemplateInst.cc - Instantiation file. Generates the object code for the templated BinarySearchTree and BSTNode classes.
Makefile - Sample Makefile which will compile the bst-test program.

The files you should turn in (using the Unix turnin tool. This project is called "project5" in the turnin system):

All of your C++ code
README - an explanation of your design decisions, the issues that came up while writing this project, as well as
Makefile - A Makefile which compiles your word-count program

V. Extra Credit!

Profiling - Profile all three trees using gprof. You should run your trees on at least two texts: words.txt and a text of your choice from Project Gutenberg. Your README should include a short paragraph with the following:
- Your expected performance bottlenecks
- How your expectations differed (or were the same) as the gprof results. Did you find anything unusual/unexpected?
- The biggest bottleneck for your program, and what you think may be the cause
Algorithmic Analysis - Implement SelectionSort and a sorting algorithm other than HeapSort. This second algorithm should have a big-O runtime equal to, or better than, HeapSort. Your README should include a short paragraph comparing the three algorithms (including plots). You should make a convincing argument regarding the comparative runtimes of SelectionSort, HeapSort, and your chosen sorting algorithm.
If you do this extra credit, add the following command line argument to word-count: -sort <name of sort>. Be sure to document in your readme the allowed sorts that one can use.
Word Stemming - Word stemming is a process in which:
- common word suffixes like "ing", "ed", and "ly" are removed,
- plural words are made singular
- ... well, you get the picture. Other simplifications include removing prefixes and changing verbs to nouns if possible.
So, a word-stemming word-frequency-counter would count the word "car" twice in the following sentance: "The red car hit the blue cars".
Note that simply removing certain letters or strings at the end of words will not work: "Swiss" is not the plural of "Swis", nor is "bed" the past tense of "b". Simple word-stemming algorithms can be found on the internet, so your best reference willprobably be a search engine like Google

VI. Interesting Tidbits

Word frequency analysis plays an important role in Cryptanalysis, the science of breaking secretly encoded messages. The first mention of using the frequency distribution of a language to to break codes was in a 14-volume Arabic encyclopedia written by al-Qalqashandi in 1412. The idea is attributed Ibn al-Duraihim, the brilliant 13th century Arab Cryptanalyst. If you are interested in cryptography, be sure to check out The Code Book by Simon Singh. This is great introduction to cryptography and cryptanalysis.
Wondering where we got these text files? From the Project Gutenburg site, which has thousands of books as plain text files! Their mission is to provide electronic versions of many popular public domain texts. Check it out! Try running your word counting program on the King James Bible. (Guess which word comes up more frequently in the Bible "he" or "she?"... and by a factor of what?)
Think computer science is all fun and games? The Codebreakers, by David Kahn, is a fascinating look at many of the problems solved by crypotanalysis, from breaking WWII secret messages to the very question of who wrote Shakespeare's works!

VII. Credits

Inspiration for this assignment comes from the Winter 2000 offering of CSE 326, and the Autumn 2001 CSE 326 TA, Adrien Treuille.