Project 2 - Version 2.0
History Sleuths!
Due Friday November 2nd
O. Changes Between Version 1.0 and Version 2.0
We have relaxed the requirements for Project 2. You should have received
an
e-mail about the changes. You can still read the original version of
the project here. The changes are as follows:
-
In the output file, the frequency comes first. This is the opposite of
the first project 2.
-
The output file should be sorted lexographically instead of by frequency,
-
To see your output sorted by frequency, use the sort command-line
utility, as described in section II.
If you have already invested a lot of time in the orginal version of project
2, don't panic! The additional step of outputting the file in frequency
order is being offered as extra credit.
I. Introduction
You have just been approached by a world famous UW history professor.
He would like you to settle a
centuries-old debate on who wrote Shakespeare's plays, Shakespeare
or Sir Francis Bacon?
You protest that this question is surely outside of your area expertise.
"Oh, no," chuckled the historian, stroking his snowy white beard. "I need
a Computer Scientist!"
II. Word Frequency Analysis
The professor suspects that some authors use particular words more often
than others. He hopes that if we study the frequency with which authors
use words, we may be able to come up with a word usage "signature" for
that author. This signature should be quite consistant across a particular
author's works, but vary greatly between authors.
The professor wants you to take two works of Shakespeare (Hamlet,
and All's Well That Ends Well), and two of Bacon
(The New Atlantis, and The
Essays), and count the number of times that each word occurs in each.
He would like you write a program that takes two commmand line arguments.
The first argument is the name of a text file to be parsed, and the second
is a text file that your program will output. The output file will be in
the following format:
1 "AS-IS".
1 "Defect"
1 "Pro-
1 "Project
1 "Right
2 "Small
1 "small
1 #1787]
1 &c.'
1 ''Tis
5 'A
...where the first string is the frequency that the second string occurs
in the text. Strangely enough, the professor wants you to hand in this
project using the turnin program to the CSE 326 database. He would like
a copy of your source code, an explanation of how to compile your program,
and a 1-2 paragraph answer to his question: based on the data you have
accumulated,did Bacon write Shakespeare's plays? In order to answer this
question, you may find it useful to use the sort utility as follows:
sort -nr output.txt > output-sorted.txt
This will take your output file (which we are calling output.txt
here) and create another file (called output-sorted.txt) which
will sorted by frequency! The new file will look like this:
970 the
708 and
666 of
632 to
521 I
466 a
444 my
391 in
383 you
358 Ham.
Note that you no longer have to do the sorting by frequency yourself,
though that is offered as extra credit.
III. The Nitty Gritty - IMPORTANT!!!
-
You may work on this project in teams of up to two.
-
You must implement the word counting algorithm in the following
three ways:
-
By loading the strings as keys into an unbalanced binary search tree. The
values are the frequency with which that word has occured so far. This
should be compilable into an executable called "bst".
-
As above, except using AVL trees, instead of using unbalance binary search
trees. This should be compilable into an executable called "avl".
-
As above, except using Splay trees. The executable here should be called
"splay."
-
Make your balanced binary tree classes inherit from you unbalanced
binary tree class. The functions to add and get elements should be delcared
virtual, and related polymorphically across trees.
-
Your frequency count should be case sensitive. That is, "Shakespeare" and
"sHaKeSpEaRe" should be counted seperately.
-
You should only skip over the following characters: the space character
(' '), the tab character ('\t'), the newline character ('\n'),
the carriage-return charactger ('\r'), and the form-feed character
('\f'). All other characters should be considered part of words.
-
You will find that the text files we give you contain some extraneous information
(e.g. copyright stuff). Don't waste your time trying to delete this, as
it will have an negligable effect on the output.
-
You do not need to implement delete functions, since these aren't needed
for this project.
-
You must implement these data structures from scratch. This time around
we won't give you and definitions to implement. So it's all up to you!!!
-
Have fun, and uncover deep historical truths!
IV. Files for Project 2
-
The files we have provided for you:
-
The files you should turn in (Bundle these files with the tar
utlity and turn them in with turnin.
This project is called "project2" in the turnin system.)
-
bst.cpp - the program using the unbalanced binary search tree
implementation
-
avl.cpp - the program using AVL trees
-
splay.cpp - the program using splay trees
-
readme.txt - an explanation of how to compile your program (note:
your program should compile with g++ on UNIX)
-
my-answer.txt - a one to two paragraph explanation of whether
or not you think that Bacon wrote Shakespeare's plays based on the data
you collected. No fancy statistical analyses here. Keep it fun and simple.
V. Extra Credit!
-
Have the output file be sorted by frequency instead of by word order, as
described in the original version of project
2. Here are some hints for doing
this.
-
Implement the program with Red-Black trees. The advantage of Red-Black
trees is that you can write non-recursive insertion/deletion algorithms
(the trade-off is that Red-Black trees have weaker balancing condition,
though they do guarantee O(log(n)) depth) . Don't cheat, write
non-recursive algorithms here.
-
Compare the running time of the trees on the words
file using either the time utility or (even better) the gprof
utility. The words file is an in-order list of words in the English language.
This is the worse case for the unbalanced binary trees. So you will find
the balanced trees way outperforming the unbalanced tree here.
VI. Interesting Tid Bits
-
Word frequency analysis plays an important role in Cryptanalysis, the science
of breaking secretly encoded messages. The first mention of using the frequency
distribution of a language to to break codes was in a 14-volume Arabic
encyclopedia written by al-Qalqashandi in 1412. The idea is attributed
Ibn al-Duraihim, the brilliant 13th century Arab Cryptanalyst. If you are
interested in cryptography, be sure to check out The
Code Book by Simon Singh. This is great introduction to cryptography
and cryptanalysis.
-
Wondering where we got these text files? From the Project
Gutenburg site, which has thousands of books as plain text files! Check
it out. Run your word couting program on the King James Bible! (Guess which
word comes up more frequently in the Bible "he" or "she?".. by a factor
of what?)