Homework 4 - History Sleuths!
Due by start of class (2:30 pm) on Wednesday, November 13
I. Introduction
You have just been approached by a world famous UW history professor. He
would like you to settle a centuries-old debate
on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon?
You protest that this question is surely outside of your area expertise. "Oh,
no," chuckled the historian, stroking his snowy white beard. "I need a Computer
Scientist!"
II. Word Frequency Analysis
The professor suspects that some authors use particular words more often
than others. He hopes that if we study the frequency with which authors use
words, we may be able to come up with a word usage "signature" for that author.
This signature should be quite consistant across a particular author's works,
but vary greatly between authors.
The professor wants you to take two works of Shakespeare (Hamlet, and All's Well That Ends Well),
and two of Bacon (The New Atlantis,
and The Essays), and count the number
of times that each word occurs in each. The output should be in the following
format:
970 the
708 and
666 of
632 to
521 I
466 a
444 my
391 in
383 you
...where the first number is the frequency that the second string occurs
in the text. Strangely enough, the professor wants you to hand in this project
using the turnin program to the CSE 326 database. He would like a copy of
your source code, and a 1-2 paragraph answer to his question: based on the
data you have accumulated, did Bacon write Shakespeare's plays?
III. The Nitty Gritty
- You may work on this project in teams of up to three.
- You must implement the word counting algorithm as follows:
- Load the strings as keys into several types of binary search
trees; the values are the frequency that this word has occured so
far. You will need to implement an unbalanced binary search tree, an
AVL tree, and a hash table. (For extra credit, you may implement a splay
tree as well.)
- Make your balanced binary tree classes inherit from
your unbalanced binary tree class.
- Your frequency count should be case sensitive. That is, "Shakespeare"
and "sHaKeSpEaRe" should be counted seperately.
- You should only skip over the following characters:
the space character (' '), the tab character ('\t'), the
newline character ('\n'), the carriage-return character
('\r'), and the form-feed character ('\f'). All other
characters should be considered part of words. Hint: you may find certain
library functions useful for deciding what is "valid" input.
- Your implementation may use a sorting algorithm of your choice.
- Your main method should be in a class called WordCount.
The user will specify the tree type your program will use with command
line arguments:
- -b - Count frequencies using an unbalanced binary
search tree
- -a - Count frequencies using an AVL tree
- -h - Count frequencies using a hash table
- -s - Count frequencies using a splay tree (if implemented)
- WordCount will read its input from the standard input
(System.in in Java terminology), and print to the standard output
(System.out). Using the magic of redirection in Unix, you can
test your program like this:
- java WordCount -a < hamlet.txt > hamlet-freq.txt
Count the words in hamlet.txt with an AVL tree, and place
the result in hamlet-freq.txt
- java WordCount -b < the-essays.txt
Count the words in the-essays.txt with an unbalanced binary
search tree, and print the output onto the screen.
On this project, your code will be expressedly analyzed for its efficiency.
Since you are using large inputs, constants matter. That is, the
difference between a 2n2 and a 4n2 algorithm will
be noticeable. After your code is working, you should analyze each
algorithm to make sure it is not doing unnecessary repetitive work, and
then eliminate this excess. As always, correctness will count more than efficiency,
so make sure your algorithms are correct before worrying about efficiency.
You will find that the text files we give you contain some extraneous
information (e.g. copyright stuff). Don't waste your time trying to delete
this, as it will have an negligable effect on the output.
Have fun, and uncover deep historical truths!
IV. Files and Sample Code
- The files we have provided for you:
For testing purposes, more texts are available on the instructional
machines, at /cse/courses/cse326/course_archives/1999-2000/cse326/2000wi/texts.
- Sample Java code for all of these data structures appears in the
Weiss book. The source code in the book may be also found online.
This should help considerably in your coding; you will merely need to modify
the structures to handle both keys and values, as well as increment counts
efficiently
- The files you should turn in (using the Unix turnin
tool. This project is called "project4" in the turnin system):
- All of your java code
- README - an explanation of your design decisions,
the issues that came up while writing this project, as well as
a one to two paragraph explanation of whether or not you think that
Bacon wrote Shakespeare's plays based on the data you collected.
No fancy statistical analyses here. Keep it fun and simple.
V. Extra Credit!
VI. Interesting Tidbits
- Word frequency analysis plays an important role in Cryptanalysis,
the science of breaking secretly encoded messages. The first mention of
using the frequency distribution of a language to to break codes was
in a 14-volume Arabic encyclopedia written by al-Qalqashandi in 1412.
The idea is attributed Ibn al-Duraihim, the brilliant 13th century Arab
Cryptanalyst. If you are interested in cryptography, be sure to check
out The Code Book
by Simon Singh. This is great introduction to cryptography and cryptanalysis.
- Wondering where we got these text files? From the Project Gutenburg site, which has thousands
of books as plain text files! Their mission is to provide electronic versions
of many popular public domain texts. Check it out! Try running your word
counting program on the King
James Bible. (Guess which word comes up more frequently in the Bible
"he" or "she?"... and by a factor of what?)
- Think computer science is all fun and games?
The Codebreakers, by David Kahn, is a fascinating look at many of
the problems solved by crypotanalysis, from breaking WWII secret messages
to the very question of who wrote Shakespeare's works!
VII. Credits
Inspiration for this assignment comes from the Winter 2002 offering of CSE 326.