Homework 4

Homework 4 - History Sleuths!
Due by start of class (2:30 pm) on Wednesday, November 13

I. Introduction

You have just been approached by a world famous UW history professor. He would like you to settle a centuries-old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside of your area expertise. "Oh, no," chuckled the historian, stroking his snowy white beard. "I need a Computer Scientist!"

II. Word Frequency Analysis

The professor suspects that some authors use particular words more often than others. He hopes that if we study the frequency with which authors use words, we may be able to come up with a word usage "signature" for that author. This signature should be quite consistant across a particular author's works, but vary greatly between authors.

The professor wants you to take two works of Shakespeare (Hamlet, and All's Well That Ends Well), and two of Bacon (The New Atlantis, and The Essays), and count the number of times that each word occurs in each. The output should be in the following format:

970 the
708 and
666 of
632 to
521 I
466 a
444 my
391 in
383 you

...where the first number is the frequency that the second string occurs in the text. Strangely enough, the professor wants you to hand in this project using the turnin program to the CSE 326 database. He would like a copy of your source code, and a 1-2 paragraph answer to his question: based on the data you have accumulated, did Bacon write Shakespeare's plays?

III. The Nitty Gritty

You may work on this project in teams of up to three.
You must implement the word counting algorithm as follows:

Load the strings as keys into several types of binary search trees; the values are the frequency that this word has occured so far. You will need to implement an unbalanced binary search tree, an AVL tree, and a hash table. (For extra credit, you may implement a splay tree as well.)
Make your balanced binary tree classes inherit from your unbalanced binary tree class.
Your frequency count should be case sensitive. That is, "Shakespeare" and "sHaKeSpEaRe" should be counted seperately.
You should only skip over the following characters: the space character (' '), the tab character ('\t'), the newline character ('\n'), the carriage-return character ('\r'), and the form-feed character ('\f'). All other characters should be considered part of words. Hint: you may find certain library functions useful for deciding what is "valid" input.

Your implementation may use a sorting algorithm of your choice.
Your main method should be in a class called WordCount. The user will specify the tree type your program will use with command line arguments:
- -b - Count frequencies using an unbalanced binary search tree
- -a - Count frequencies using an AVL tree
- -h - Count frequencies using a hash table
- -s - Count frequencies using a splay tree (if implemented)
WordCount will read its input from the standard input (System.in in Java terminology), and print to the standard output (System.out). Using the magic of redirection in Unix, you can test your program like this:
- java WordCount -a < hamlet.txt > hamlet-freq.txt
  Count the words in hamlet.txt with an AVL tree, and place the result in hamlet-freq.txt
- java WordCount -b < the-essays.txt
  Count the words in the-essays.txt with an unbalanced binary search tree, and print the output onto the screen.

On this project, your code will be expressedly analyzed for its efficiency. Since you are using large inputs, constants matter. That is, the difference between a 2n² and a 4n² algorithm will be noticeable. After your code is working, you should analyze each algorithm to make sure it is not doing unnecessary repetitive work, and then eliminate this excess. As always, correctness will count more than efficiency, so make sure your algorithms are correct before worrying about efficiency.

You will find that the text files we give you contain some extraneous information (e.g. copyright stuff). Don't waste your time trying to delete this, as it will have an negligable effect on the output.

Have fun, and uncover deep historical truths!

IV. Files and Sample Code

The files we have provided for you:

hamlet.txt (Hamlet, by Shakespeare)
alls-well-that-ends-well.txt (All's Well That Ends Well, by Shakespeare)
the-new-atlantis.txt (The New Atlantis, by Bacon)
the-essays.txt (The Essays, by Bacon)
words.txt (for extra credit, see below)

/cse/courses/cse326/course_archives/1999-2000/cse326/2000wi/texts

Sample Java code for all of these data structures appears in the Weiss book. The source code in the book may be also found online. This should help considerably in your coding; you will merely need to modify the structures to handle both keys and values, as well as increment counts efficiently

The files you should turn in (using the Unix turnin tool. This project is called "project4" in the turnin system):

All of your java code
README - an explanation of your design decisions, the issues that came up while writing this project, as well as

V. Extra Credit!

Implement a splay tree and offer it as an additional option, as described above.

Implement word stemming. Word stemming is a process in which:
- common word suffixes like "ing", "ed", and "ly" are removed,
- plural words are made singular
- ... well, you get the picture. Other simplifications include removing prefixes and changing verbs to nouns if possible.
So, a word stemming word-frequency-counter would count the word "car" twice in the following sentance: "The red car hit the blue cars".
Note that simply removing certain letters or strings at the end of words will not work: "Swiss" is not the plural of "Swis", nor is "bed" the past tense of "b". Simple word-stemming algorithms can be found on the internet, so your best reference will probably be a search engine like Google.

VI. Interesting Tidbits

Word frequency analysis plays an important role in Cryptanalysis, the science of breaking secretly encoded messages. The first mention of using the frequency distribution of a language to to break codes was in a 14-volume Arabic encyclopedia written by al-Qalqashandi in 1412. The idea is attributed Ibn al-Duraihim, the brilliant 13th century Arab Cryptanalyst. If you are interested in cryptography, be sure to check out The Code Book by Simon Singh. This is great introduction to cryptography and cryptanalysis.
Wondering where we got these text files? From the Project Gutenburg site, which has thousands of books as plain text files! Their mission is to provide electronic versions of many popular public domain texts. Check it out! Try running your word counting program on the King James Bible. (Guess which word comes up more frequently in the Bible "he" or "she?"... and by a factor of what?)
Think computer science is all fun and games? The Codebreakers, by David Kahn, is a fascinating look at many of the problems solved by crypotanalysis, from breaking WWII secret messages to the very question of who wrote Shakespeare's works!

VII. Credits

Inspiration for this assignment comes from the Winter 2002 offering of CSE 326.