CSE 326 Assignment 3

CSE 326 - Winter 2003

Assignment 3

Shake-n-Bacon: Data Mining with Trees

Assigned: Friday, Jan. 31
Electronic Turn-In Due: Thursday, Feb. 13, 11:59pm
Hard Copy Due: Beginning of Class on Friday, Feb. 14

Introduction

You've just graduated from UW (congrats!) and landed a position as Data Structures specialist in the Nasdaq-listed company Gargle, Inc. Your first assignment, as a member of their data mining team, is to write a program which, when given an input text file (web pages, articles, books, etc.), tells you the frequencies of words occurring in the file.

Your boss hopes to apply the program to classifying web pages into various categories. "How boring!" you say to yourself. Fortunately, you hit upon a more interesting use for your program when your company is approached by a well-known UW literature professor who has been trying to settle the age-old controversy: Who really wrote Shakespeare's plays: Shakespeare or Sir Francis Bacon? The professor mentions that she's sought this answer from legions of students from UW's CSE 326 classes in the past few quarters, only to receive contradictory answers. So now she's going to the professionals (you!).

Spotting Bacon using Word Frequency Signatures

Here's your strategy: since authors tend to use some words more often than others, you decided to compute a word usage "signature" for each author, which is based on the frequencies of words found in the author's works. You hypothesize that this signature should be quite consistent across the works of a particular author, but vary greatly between authors. To compute an author's signature, you simple apply your word frequency counting program to a text file containing the author's works. The program outputs the alphabetical list of words occurring in the file, each followed by the number of its occurrences. For example:

        Adieu 1

        Admiringly 1

        After 2

        Against 4

        All 15

        Am 1

        Among 5

        An 5

        And 106

        Another 3

        Are 6

        As 29

        (etc.)

Given such signatures for a pair of works, you can simply compute a score using a "similarity" metric, which tells how close these two works are (as given by their signatures). This score is low if both works are of the same author, but high if the authors are different.

To answer the professor's question, you first obtain Bacon's word usage signature by running your program on The New Atlantis, one of Bacon's works. Then, you take a play attributed to Shakespeare, say, Hamlet, compute its signature, and presto! If the similarity score between the two signatures is low enough, you can conclude that the given work is, in fact, written by Bacon.

The Details

For unexplainable reasons, the professor is not interested only in your final answer. She wants you to implement certain data structures and to write two programs, WordCount and CompareWorks, which assist in achieving the above goals while relying on the data structures. These programs should behave as follows.

WordCount reads in a file, counts its words using a tree data structure, and prints out the resulting (word, count) pairs as shown in the example printout above. It accepts two arguments:

a flag indicating what kind of tree to use: -b for the generic (unbalanced) binary search tree, -a for AVL tree, and -s for Splay tree;
the name of the file whose words are to be counted.

The professor wants to invoke WordCount as follows:

        java WordCount -b /cse/courses/cse326/texts/the-new-atlantis.txt

        java WordCount -a /cse/courses/cse326/texts/hamlet.txt

CompareWorks reads in two files and computes and prints the similarity metric-based score. (The similarity metric is explained below.) It accepts three arguments:

a flag indicating the kind of tree to use, similarly to WordCount,
the name of the file with the work whose author is known, and
the name of the file with the work of an unknown (or uncertain) author.

CompareWorks can be invoked as follows:

        java CompareWorks -s /cse/courses/cse326/texts/the-new-atlantis.txt /cse/courses/cse326/texts/hamlet.txt

These programs rely on the following classes, which you need to implement. These classes are key to this assignment and will take most of your effort.

Datum keeps a word (as a string), its count, and a double-precision floating-point number (of which you might make some use later)

· BinarySearchTree implements a generic binary search tree, whose nodes store Datums. BinarySearchTree should provide the following methods:

o a constructor. It takes no arguments and produces an empty tree.

o void insert(String word) adds a word to the tree. If the word was already in the tree, its count should be incremented by 1; otherwise, the count should be set to 1.

o void delete(String word) deletes the given word from the tree (together with its count). If the word is not in the tree, this method should do nothing.

o Datum find(String word) finds the given word in the tree and returns the corresponding Datum. If the word is not in the tree, find should throw the WordNotFound exception (Java only).

o void printCounts() prints out the words stored in the tree. Each word is printed on a separate line and is followed by its count, as shown in the example printout above.

o other methods are specified later in this assignment.

· BSTWithRotations is a helper subclass of BinarySearchTree that implements all the rotations you will need for your AVL and Splay trees.

· AVLTRee is a subclass of BSTWithRotations that reimplements the operations specified above to ensure that the tree remains AVL-balanced at the end of each operation. Node height should be stored at each node and the balance computation should be implemented efficiently.

· SplayTree is a subclass of BSTWithRotations that performs splaying upon insert, delete, and find. Each node should store a pointer to its parent.

A couple of implementation notes are in order. When reading a file, assume that a "word" is a sequence of characters between 'a' and 'z' or 'A' and 'Z'. All other characters are "word separators" and should not be attributed to any word. Word counting should be case sensitive, i.e., "Shakespeare" and "sHaKeSpEaRe" should be counted seperately. You can rely on the compareTo operation for Java Strings or a similar library function for C++.

You will find that the text files we give you contain some extraneous information (e.g. copyright stuff). Don't waste your time trying to delete this, as it will have a negligible effect on the output.

Is This The Same Author?

So what is the magic that lets you decide whether two works are of the same author? First, it is the similarity metric-based score, which should be computed as follows:

1. Add the normalize method to the BinarySearchTree class. This method should compute the "relative" frequency of each word, which is obtained by dividing the word count by the total number of words in the text (and, therefore, in the tree). The relative frequencies should be stored in the Datum's field that you already allocated.

2. Build the word frequency tree (as if you were computing word counts) for the "reference" work, i.e., the work whose author is assumed known.

3. Normalize the tree.

4. "Prune" this tree by deleting all words whose relative frequencies are below 0.0001 (0.01%) or above 0.01 (1%). This will be your "reference" tree.

5. Build the word frequency tree for the second work in question, normalize it, but do not prune it. This will be your "test" tree.

6. Traverse the reference tree. For each word, compute the difference in frequencies of this word in the reference tree and in the test tree. If the word is not in the test tree, assume its frequency is zero. Add up the squares of all these frequency differences. (This similarity metric corresponds to the square of the Euclidean distance between two vectors.)

7. Print out this sum. This is the similarity score for the pair of works you have looked at.

Once you can compute the similarity metric, do some research using known works by the same author and known works of other authors. What values of the score are "low enough" to indicate that the author is the same? Pick an appropriate threshold for the score so that your program automatically prints both the score and a message that the authors are the same whenever the score is below your threshold.

Questions

Now that you have seen your program run, you need to answer the following three questions.

1. Based on the outputs of CompareWorks, did Bacon write one or more of Shakespeare's plays? Justify your answer by explaining your experimental methodology and listing the score(s) you have obtained. Evaluate your methodology by checking it on works known to be by the same author.

2. For each type of tree, obtain the running time of CompareWorks with the inputs The New Atlantis as the “reference” work and Hamlet as the “test” work. Which of the search trees yields the fastest running times and which yields the slowest?

3. As discussed in class, AVL trees have the strongest guarantees about run time: all operations are guaranteed to take no more than O(log(N)) time. Do either of the other two trees ever outperform the AVL tree for some pair of text files? Why or why not? If so, how is this possible?

Extra Credit!

A. Implement word stemming. Word stemming is a process in which:

common word suffixes like "ing", "ed", and "ly" are removed,
plural words are made singular,
... well, you get the picture. Other simplifications include removing prefixes and changing verbs to nouns if possible.

So, a word stemming word-frequency-counter would count the word "car" twice in the following sentence: "The red car hit the blue cars".

Note that simply removing certain letters or strings at the end of words will not work: "Swiss" is not the plural of "Swis", nor is "bed" the past tense of "b". Simple word-stemming algorithms can be found on the internet, so your best reference will probably be a search engine like Google.

If you implement stemming, make it an optional command line parameter "-stem".

B. Instead of the output of CompareWorks being just yes or no, you may want to do something like have the output include a confidence rating, i.e., yes (80%) meaning "I am 80% sure these were written by the same author." In addition, you may want to use a cleverer metric for judging similarity than the sum of squared differences in frequencies. How you go about this is largely up to you, and there are good solutions using techniques from areas ranging from statistical analysis to artificial intelligence. Before diving into anything, it may be a good idea to run your general idea by your TA. A variable amount of extra credit will be awarded depending on the creativity and thought that is put into your program and of course, how well it works.

Turn-In Instructions

This is a group assignment. You may work in a group of one to three people.

Turn in electronically all the relevant source files only. Follow the usual instructions.

Bring to class the printouts of the source files as well as written answers to the required questions above. Also, write down separately anything interesting about your implementation (especially if you did something for extra credit). Be sure to specify the members of your group and which of you did the electonic submission.

To ensure we get consistent-looking printouts, please use the following command to print from UNIX:

        enscript Pps329 -DDuplex:true -DTumble:true -2 -r -C --word-wrap G *.java

This assumes that you are sending your files to the printer ps329 and that you are in the directory with all your Java source files, referred to by *.java. Adjust this command as appropriate.

Reference Files

You will find a modest collection of text files in /cse/courses/cse326/texts, including:

hamlet.txt (Hamlet, by Shakespeare?)
alls-well-that-ends-well.txt (All's Well That Ends Well, by Shakespeare?)
the-new-atlantis.txt (The New Atlantis, by Bacon)
the-essays.txt (The Essays, by Bacon)
bible-king-james.txt (King James Bible, by ??)

Interesting Tidbits

Word frequency analysis plays an important role in Cryptanalysis, the science of breaking secretly encoded messages. The first mention of using the frequency distribution of a language to to break codes was in a 14-volume Arabic encyclopedia written by al-Qalqashandi in 1412. The idea is attributed Ibn al-Duraihim, the brilliant 13th century Arab Cryptanalyst. If you are interested in cryptography, be sure to check out The Code Book by Simon Singh. This is great introduction to cryptography and cryptanalysis.

Wondering where we got these text files? From the Project Gutenburg site, which has thousands of books as plain text files! Their mission is to provide electronic versions of many popular public domain texts. Check it out! Try running your word counting program on the King James Bible. (Guess which word comes up more frequently in the Bible "he" or "she?"... and by a factor of what?)

Think computer science is all fun and games? The Codebreakers, by David Kahn, is a fascinating look at many of the problems solved by crypotanalysis, from breaking WWII secret messages to the very question of who wrote Shakespeare's works!

Credits

This assignment is based on past incarnations developed by Steve Wolfman and Henry Kautz.

Text files are courtesy of the Project Gutenburg.

A Reminder

The literature professor demands that all the data structures you write be your own. She has seen lots of them on the web and in other places, but those simply don't cut it! Please be sure you do not reference library data structures or copy (even partially) any code that is available from a variety of sources. The goal of this class is for you to learn how to implement and use each data structure, not simply to copy and use.