CSE 326 Assignment 5

CSE 326 - Summer 2003 -- Homework #5

Shake-n-Bacon: Data Mining with Trees

Assigned: Wednesday, July 23
Partner Selection Due: Friday, July 25, 9:00 p.m. (email the instructor)
Electronic Turn-In Due: Tuesday, August 5, 11:00pm
Hard Copy Due: Beginning of Class on Wednesday, August 6

Introduction

You've just graduated from UW (congrats!) and landed a position as Data Structures specialist in the Nasdaq-listed company Gargle, Inc. Your first assignment, as a member of their data mining team, is to write a program which, when given an input text file (web pages, articles, books, etc.), tells you the frequencies of words occurring in the file.

Your boss hopes to apply the program to classifying web pages into various categories. "How boring!" you say to yourself. Fortunately, you hit upon a more interesting use for your program when your company is approached by a well-known UW literature professor who has been trying to settle the age-old controversy: Who really wrote Shakespeare's plays: Shakespeare or Sir Francis Bacon? The professor mentions that she's sought this answer from legions of students from UW's CSE 326 classes in the past few quarters, only to receive contradictory answers. So now she's going to the professionals (you!).

General Notes and Advice

1. If you wish, you may work in a group of up to two people. All the previous rules apply: work together, understand each other’s code, all team members will receive the same grade. However, you may not work with any of the same people that you have already worked with on previous assignments.

2. In the past, students have found similar instantiations of this assignment to be both fun yet also time-consuming. Start early!

3. Late Days: If anyone in your group has a late day remaining, then you may use it for this assignment. Note that, due to the end of the quarter, no late days will be possible for the next (and final) programming assignment.

4. We highly recommend that you carefully read this document twice through before starting any code.

Spotting Bacon using Word Frequency Signatures

Here's your strategy: since authors tend to use some words more often than others, you decided to compute a word usage "signature" for each author, which is based on the frequencies of words found in the author's works. You hypothesize that this signature should be quite consistent across the works of a particular author, but vary greatly between authors. To compute an author's signature, you simple apply your word frequency counting program to a text file containing the author's works. The program outputs the alphabetical list of words occurring in the file, each followed by the number of its occurrences. For example:

        Adieu 1

        Admiringly 1

        After 2

        Against 4

        All 15

        Am 1

        Among 5

        An 5

        And 106

        Another 3

        Are 6

        As 29

        (etc.)

Given such signatures for a pair of works, you can simply compute a score using a "similarity" metric, which tells how close these two works are (as given by their signatures). This score is low if both works are (likely) of the same author, but high if the authors are (likely) different.

To answer the professor's question, you first obtain Bacon's word usage signature by running your program on The New Atlantis, one of Bacon's works. Then, you take a play attributed to Shakespeare, say, Hamlet, compute its signature, and presto! If the similarity score between the two signatures is low enough, you can conclude that the given work is, in fact, likely written by Bacon.

The Details

For unexplainable reasons, the professor is not interested only in your final answer. She wants you to implement certain data structures and to write two programs, WordCount and CompareWorks, which assist in achieving the above goals while relying on the data structures. These programs should behave as follows.

WordCount reads in a file, counts its words using a tree data structure, and prints out the resulting (word, count) pairs as shown in the example printout above. It accepts two arguments:

a flag indicating what kind of tree to use: -b for the generic (unbalanced) binary search tree, -a for AVL tree, and -s for a Splay tree;
the name of the file whose words are to be counted.

The professor wants to invoke WordCount as follows:

   java –classpath . WordCount -b /cse/courses/cse326/texts/the-new-atlantis.txt

   java –classpath . WordCount -a /cse/courses/cse326/texts/hamlet.txt

CompareWorks reads in two files and computes and prints the similarity metric-based score. (The similarity metric is explained below.) It accepts three arguments:

a flag indicating the kind of tree to use, similarly to WordCount,
the name of the file with the work whose author is known, and
the name of the file with the work of an unknown (or uncertain) author.

CompareWorks can be invoked as follows:

   java –classpath . CompareWorks -s /cse/courses/cse326/texts/the-new-atlantis.txt

       /cse/courses/cse326/texts/hamlet.txt

These programs rely on the following classes, which you need to implement. These classes are key to this assignment and will take most of your effort.

Datum keeps a word (as a string), its count, and a double-precision floating-point number (of which you might make some use later)

· BinarySearchTree implements a generic binary search tree, whose nodes store Datums. BinarySearchTree should provide the following methods:

o a constructor. It takes no arguments and produces an empty tree.

o void insert(String word) adds a word to the tree. If the word was already in the tree, its count should be incremented by 1; otherwise, the count should be set to 1.

o void delete(String word) deletes the given word from the tree (together with its count). If the word is not in the tree, this method should do nothing. You will probably find it easiest to use lazy deletion.

o Datum find(String word) finds the given word in the tree and returns the corresponding Datum. If the word is not in the tree, find should throw the WordNotFound exception.

o void printCounts() prints out the words stored in the tree, in alphabetical order. Each word is printed on a separate line and is followed by its count, as shown in the example printout above.

o other methods are specified later in this assignment.

· BSTWithRotations is a helper subclass of BinarySearchTree that implements all the rotations you will need for your AVL and Splay trees.

· AVLTRee is a subclass of BSTWithRotations that reimplements the necessary operations specified above to ensure that the tree remains AVL-balanced at the end of each operation. Node height should be stored at each node and the balance computation should be implemented efficiently.

· SplayTree is a subclass of BSTWithRotations that performs splaying upon insert and find (if using lazy deletion, you can leave the node position unchanged on delete). Each node should store a pointer to its parent.

A couple of implementation notes are in order. When reading a file, assume that a "word" is a sequence of characters between 'a' and 'z' or 'A' and 'Z'. All other characters are "word separators" and should not be attributed to any word. Word counting should be case sensitive, i.e., "Shakespeare" and "sHaKeSpEaRe" should be counted separately. You can rely on the compareTo operation for Java Strings.

There is no starter code for this assignment, you get to do it all! For general help on parsing arguments and reading in Strings from a file, you may want to look at the starter code for Homework #3.

You will find that the text files we give you contain some extraneous information (e.g. copyright stuff). Don't waste your time trying to delete this, as it will have a negligible effect on the output.

Is This The Same Author?

So what is the magic that lets you decide whether two works are of the same author? First, it is the similarity metric-based score, which you should implement as follows:

1. Add the normalize() method to the BinarySearchTree class. This method should compute the "relative" frequency of each word, which is obtained by dividing the word count of each Datum by the total number of words in the text (and, therefore, in the tree). The relative frequencies should be stored in the Datum's double field that you already allocated.

2. Build the word frequency tree (as if you were computing word counts) for the "reference" work, i.e., the work whose author is assumed known.

3. Normalize the tree.

4. "Prune" this tree by deleting all words whose relative frequencies are below 0.0001 (0.01%) or above 0.01 (1%). This will be your "reference" tree.

5. Build the word frequency tree for the second work in question, normalize it, but do not prune it. This will be your "test" tree.

6. Traverse the reference tree. For each word, compute the difference in frequencies of this word in the reference tree and in the test tree. If the word is not in the test tree, assume its frequency is zero. Add up the squares of all these frequency differences. (This similarity metric corresponds to the square of the Euclidean distance between two vectors.)

7. Print out this sum. This is the similarity score for the pair of works you have looked at.

Once you can compute the similarity metric, do some research using known works by the same author and known works of other authors. What values of the score are "low enough" to suggest that the author is the same? Pick an appropriate threshold for the score so that your program automatically prints both the score and a message that the authors are likely the same whenever the score is below your threshold.

Going Above and Beyond

You can do none, any, or all of these options. They will be worth a nominal amount of extra credit.

A. Implement word stemming. Word stemming is a process in which:

common word suffixes like "ing", "ed", and "ly" are removed,
plural words are made singular,
... well, you get the picture. Other simplifications include removing prefixes and changing verbs to nouns if possible.

So, a word stemming word-frequency-counter would count the word "car" twice in the following sentence: "The red car hit the blue cars".

Note that simply removing certain letters or strings at the end of words will not work: "Swiss" is not the plural of "Swis", nor is "bed" the past tense of "b". Simple word-stemming algorithms can be found on the internet, so your best reference will probably be a search engine like Google.

If you implement stemming, make it an optional command line parameter "-stem".

B. Instead of the output of CompareWorks being just yes or no, you may want to do something like have the output include a confidence rating, i.e., yes (80%) meaning "I am 80% sure these were written by the same author." In addition, you may want to use a “cleverer” metric for judging similarity than the sum of squared differences in frequencies. How you go about this is largely up to you, and there are good solutions using techniques from areas ranging from statistical analysis to artificial intelligence. Before diving into anything, it may be a good idea to run your general idea by your TA. A variable amount of extra credit will be awarded depending on the creativity and thought that is put into your program and of course, how well it works.

C. Implement deletion for all of your trees and use this instead of lazy deletion. Note that this will require a fair amount of thought to correctly implement for AVL trees.

Turn-In Instructions

Turn in electronically all of your source files (use *.java) plus your usual README. Follow the usual instructions. This assignment is called “hw5” in the turn-in system. Your README should contain:

The names of your team members and an indication of who performed the electronic submission, plus your team name.
If you worked in a team, a breakdown of the work.
(Answer this question before you begin) How long do you think this project would take you to finish?
How much time did you actually spend on this project?
Acknowledgement of any assistance you received from anyone but your team members, the 326 staff, or the Weiss book.
An explanation of how to compile your program, if anything special is needed.
A brief description of how your project goes "above and beyond" the basic requirements, if it does.

You must also answer the following questions:

1. Based on the outputs of CompareWorks, did Bacon write one or more of Shakespeare's plays? Justify your answer by explaining your experimental methodology and listing the score(s) you have obtained. Evaluate your methodology by checking it on works known to be by the same author.

2. For each type of tree, obtain the running time of CompareWorks with the inputs The New Atlantis as the “reference” work and Hamlet as the “test” work. Which of the search trees yields the fastest running times and which yields the slowest? You may find the unix time command useful for this.

3. As discussed in class, AVL trees have the strongest guarantees about run time: all operations are guaranteed to take no more than O(log(N)) time. Do either of the other two trees ever outperform the AVL tree for some pair of text files? Why or why not? If so, how is this possible?

4. What did you enjoy about this assignment? What did you hate? Could we have done anything better?

Bring to class the printouts of your source files as well as your README. Please put your README file on top so it easy to find, and number the answers to the above questions.

Reference Files

You will find a modest collection of text files in /cse/courses/cse326/texts, including:

hamlet.txt (Hamlet, by Shakespeare?)
alls-well-that-ends-well.txt (All's Well That Ends Well, by Shakespeare?)
the-new-atlantis.txt (The New Atlantis, by Bacon)
the-essays.txt (The Essays, by Bacon)
bible-king-james.txt (King James Bible)

Interesting Tidbits

Word frequency analysis plays an important role in Cryptanalysis, the science of breaking secretly encoded messages. The first mention of using the frequency distribution of a language to to break codes was in a 14-volume Arabic encyclopedia written by al-Qalqashandi in 1412. The idea is attributed Ibn al-Duraihim, the brilliant 13th century Arab Cryptanalyst. If you are interested in cryptography, be sure to check out The Code Book by Simon Singh. This is great introduction to cryptography and cryptanalysis.

Think computer science is all fun and games? The Codebreakers, by David Kahn, is a fascinating look at many of the problems solved by crypotanalysis, from breaking WWII secret messages to the very question of who wrote Shakespeare's works!

Credits

This assignment is based on past incarnations developed by Steve Wolfman and Henry Kautz. Text files are courtesy of the Project Gutenburg.

A Reminder

The literature professor demands that all the data structures you write be your own. She has seen lots of them on the web and in other places, but those simply don't cut it! Please be sure you do not reference library data structures or copy (even partially) any code that is available from a variety of sources. The goal of this class is for you to learn how to implement and use each data structure, not simply to copy and use.

About code from the textbook: Ideally, we would like you to write the code on your own based on the descriptions/examples of the ADT operations provided in the lectures and the textbook. In this case, you could look up the code given in the textbook whenever you run into problems. Doing it this way will help you to learn the transition from descriptions of ADT operations to their actual implementation, which is one of the main goals of a Data Structures course. However, we will not take off any points if you use any of the book code as a template -- we will simply assume that you fully understand what the code is doing before using it (you may be asked to write some code in the midterm/final!). If you use any of the book code, make sure you explicitly state which parts are from the book in your README.