CSE 326, Winter 2007: Project 3

Project 3: Shake-n-Bacon
Data Mining with Trees and Tables

Important Deadlines

Project released	Mon, Feb 12
Partner selection	Fri, Feb 16	Notify instructor by email by 11:59pm using this link
Code Due: code for requirements Va, Vb and Vc (no writeup, no code printout)	Mon, Feb 26	Electronic turnin by 11:59pm
Benchmarking, Writeup, Above & Beyond Code Due: Submit readme file and Above & Beyond code	Thurs, March 1	Electronic turnin by 11:59pm
Paper turnin of Writeup and Benchmarking	Fri, March 2	Printout of writeup and benchmarking due at beginning of Lecture

IIIa. Word Stemming
IIIb. Signature Generation
IIIc. Document Correlation

IV. Partners
V. Requirements

Va. StringCounter implementations
Vb. WordCount
Vc. Document Correlator
Vd. Benchmarks
Ve. Readme (Very Long. Look at this before starting)

VI. Files and Sample Code
VII. Turnin
VIII. Grading Breakdown
IX. Above and Beyond Projects
X. Interesting Tidbits
XI. Credits

I. Introduction

You have just been approached by a world famous UW history professor. He would like you to settle a centuries-old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside of your area of expertise. "Oh, no," chuckles the historian, stroking his snowy white beard. "I need a Computer Scientist!"

II. Learning Objectives

For this assignment, you will:

Be introduced to StringCounter, a variant of the DictionaryADT, and understand the strengths of various StringCounter implementations
Gain proficiency with Splay and AVL trees
Become comfortable with implementing a HashTable
Understand some of the complexities of word stemming
Implement sorting algorithms
Learn how to benchmark your code
Develop a means to test your code (writing unit tests)

III. Word Frequency Analysis

Authors tend to use some words more often than others. For example, Shakespeare used "thou" more often than Bacon. The professor believes that a "signature" can be found for each author, based on frequencies of words found in the author's works, and that this signature should be consistent across the works of a particular author but vary greatly between authors. He wants you to come up with a way of quantifying the difference between two written works, and to use your technique on several of Shakespeare's and Bacon's works to settle the ancient debate!

The professor has provided you with copies of Shakespeare's (Hamlet) and Bacon's writing (The New Atlantis), which he has painstakingly typed by hand, from his antique, leather-bound first-editions. Being good scientists, however, you quickly realize that it is impossible to draw strong conclusions based on so little data, and asking him to type two more books is out of the question! Thus, you should download and analyze several more works, as many works as you feel is necessary to support your conclusion.

IIIa. Word Stemming

When dealing with document correlations, it is often desirable to work only with the roots of words. That way, "sleeps", "sleeping", and "sleep" are all considered to be the same word. This process is called word stemming, and is used in most real-world search engines. For this assignment, you only need to follow two simple rules:

Convert all the words to lowercase ("An" and "an" are the same word)
Remove all punctuation ("end." and "end" are the same word)

The supplied class FileWordReader includes code to do this processing for you.

Word stemming is a fairly complex topic. What these rules do is not so much word stemming as input normalization; you do not try to undo conjugations or other morphology. Fancier word stemming such as removing 's' from the end of a word can lead to erroneous results (such as "bu" from "bus") and require special logic. Even our simple rules cause problems; for instance, "it's" and "its" are now the same word. Implementing a better stemming algorithm (like Porter Stemming) is above and beyond work.

IIIb. Signature Generation

A fundamental part of your work lies in computing the "signature" of a document. The professor has provided you with a sample WordCount program that reads in a document and counts the number of times that a stemmed word appears, assuming that the document's words are already stemmed. The output of this program looks like this:

970 the
708 and
666 of
632 to
521 at
521 i
521 into
466 a
444 my
391 in
383 buffalo
...

where the number in column 1 is the frequency that the corresponding string in column 2 occurs in the text. Note that the WordCount program sorts its output primarily in decreasing order by frequency count, secondarily by alphabetical order. The ordering would be identical if it had sorted by frequency fraction first (i.e. frequency_count/num_total_words).

IIIc. Document Correlation

Document correlation is a reasonably large area of study. Perhaps its most visible application is in search engines which rank the correlation of webpages to a set of keywords that you provide. One model often used for correlating documents is Latent Semantic Indexing (LSI) where a collection of documents is considered together (rather than independently) and a word's usefulness is determined by how frequently it appears in the documents (for instance, "the" isn't very useful because it appears in most documents).

We will not be doing LSI (it is, however, an extra credit option). We will do a simpler document comparison:

Calculate word counts for the two documents and normalize the frequencies so that they can be meaningfully compared between different documents (hint: use frequency percentages or fractions.)
As in LSI, remove words whose relative frequencies are too high or too low to be useful to your study. A good starting point is to remove words with normalized frequencies above 0.01 (1%) and below 0.0001 (0.01%), but feel free to play around with these numbers.
For every word that occurs in both documents, take the difference between the normalized frequencies, square that difference, and add the result to a running sum.
The final value of this running sum will be your difference metric. This metric corresponds to the square of the Euclidean distance between the two vectors in the space of shared words in the document. Note that this metric assumes that words not appearing in both documents do not affect the correlation.

IV. Partners

You are encouraged (although not required) to work with a partner of your own choosing for this project. Partners do not need to be in the same quiz or lecture section. No more than two students total can be on a team together. You may divide the work however you wish, under three conditions:

Document each team member's effort in the README file
Work together and make sure you both understand your answers to the README questions below
Understand (at least) at a high level how your team member's code is structured and how it works.

Feel free to ask for partners over the message board. Also, remember to test your team's code as a whole to make sure that your portions work together properly! Do not attempt to merge your code on the project due date. You will very likely have problems. Also, be aware that except in extreme cases when you notify us in advance of the deadline, all team members will receive the same grade for the project.

You should email us, using this link by Friday, Feb 16, at 11:59 pm to indicate who you will be working with for this project. Please send an email even if you will be working alone. Please format your email like this:

  Names: Jane Smith and John Doe
  UW Usernames: jsmith and jdoe
  Submitter: jsmith

where submitter is the account name of the person who will be doing the electronic submission. This should be the same person for both the code checkin and the writeup checkin.

V. Requirements

There are five steps in this project.

Write three StringCounter dictionary implementations (AVL, Splay, Hash) and unit tests for each implementation
Modify WordCount to be able use your StringCounter implementations, and to select the implementation at runtime
Write a document correlator that will print a difference score between two documents
Benchmark your data structures and correlator
Analyze and write up the results of your performance tests

The analysis and writeup will be significantly longer in this project. Be sure to allocate time for it. It is worth 1/3 of your grade, and you will not be able to do it in an hour or two.

Va. StringCounter Implementations

For this assignment, you must implement three data structures:

AVL tree
Splay tree (recursive bottom up as discussed in class)
Hash table

All three of these data structures must implement the StringCounter interface, which is a specialized version of DictionaryADT. You do not need to implement remove in any of these structures (doing so is Above and Beyond). You can implement any hash tables discussed in class; the only restriction is that it should not restrict the size of the input domain or the number of inputs (i.e. your hash table must grow). As with the last project, you are free to use code and skeleton code from the text book (but not the text book's web site) in implementing your structures as long as you implement the StringCoutner interface. Do not use code from the web or other sources.

The history professor has provided an unbalanced BST for you; feel free to use, modify, extend, or discard it. However, if you decide to discard it, you must create your own BST class and extend (via inheritance) your AVLTree and SplayTree classes off of it.

Also each of your data structures must include a unit test. You should write this as a static main function for your data structure. The idea is to include the code that you used for testing each data structure, so that you're confident each class works correctly. See the main function in BST.java for an example.

Vb. WordCount

The WordCount program will read in a text file and tally up all the words that appear in it. The WordCount program given to you currently uses an unbalanced binary search tree as its backing StringCounter implementation. You may base your WordCount program on it, or write your own. The commandline form for WordCount will be as follows:

java WordCount [ -b | -a | -s | -h ] [ -frequency | -num_unique ] <filename>

-b Use an Unbalanced BST to implement the StringCounter
-a Use an AVL Tree
-s Use a Splay Tree
-h Use a Hashtable
-frequency Print all the word/frequency pairs, ordered by frequency, and then by the words in lexicographic order
-num_unique Print the number of unique words in the document.

It is fine to require that one of -b, -a, -s, or -h must be specified for your program to run. Your program should not crash, however, if given an invalid command line.

Note that for the -frequency option, you need to produce words ordered primarily by frequency and secondarily by lexicographic (i.e., alphabetical) order. For example:

     43 these
     42 upon
     42 your
     41 after
     41 into
     40 said
     39 many
     39 more
     38 an

The sample WordCount program does this sorting using Insertion Sort. You must write a better sorting algorithm for this in your final WordCount program. Note that you may not use any built-in Java sorting functions.

Vc. Document Correlator

The Document Correlator should take in 2 documents and return the correlation (difference metric calculation) between them. You may want to use the WordCount class given to you to implement the backend of the Correlator, though doing so is not necessary. For the basic requirements, you must design an algorithm that does the comparison specified in section IIIc Document Correlation. This program should also take command line flags to specify which backing data structure to use. The commandline structure should be:

Usage: java Correlator [ -b | -a | -s | -h ] <filename1> <filename2>

-b Use an Unbalanced BST in the backend
-a Use an AVL Tree in the backend
-s Use an Splay Tree in the backend
-h Use a Hashtable in the backend

Vd. Benchmarks

Since we are implementing so many StringCounter dictionaries in this project, it is natural to ask "which is faster." Benchmarking and profiling are really the only reliable ways to judge this since there are many many hidden assumptions in the way you write your code that will add unexpected constants to your program. Hopefully you will do some exploration in this assignment and prove to yourself that you really can't predict what will affect program runtime too much (go through and try to optimize away little things like how many assignments you do, how many if statements you execute, etc. and see how much or little this affects your program).

When generating (or reading) benchmarks, you must ask yourself the following questions:

What am I measuring? Speed is too vague. Does it mean full program runtime? What if my program waits for user input? Does it matter?
Why am I measuring this and why should anyone be interested in it? Full program runtime of an interactive user app where the users fall asleep while running the code isn't really interesting data.
What methodology will I use to measure my program? Does it actually measure what I want?
What are the sources of error? Is the error big enough to matter? Are my results still reliable?

This set of questions actually applies to any analysis.

You are required to design benchmarks that measure the attributes listed below. You may also include any other data that you feel necessary to draw conclusions from the benchmarks in your analysis.

How fast is java WordCount -frequency compared to count.sh and count.pl?
How much difference in speed do your different StringCounters make in the correlator and/or the wordcount?

There are a few tools available to you for benchmarking. The simplest two are:

The Unix time command.
System.currentTimeMillis() (get the time in 2 different stops in your program and subtract to get running time)

Both methods have their strengths and weaknesses (for instance, time must measure your process creation times, and JVM startup times). These strengths and weaknesses will exhibit themselves differently depending on where and how these tools are used. In your analysis, you will need to cite the known sources for errors in your benchmarks and justify why they don't matter for your measurements, or somehow create a correction for your measurement. Essentially, you must convince us that your benchmark is measuring something that makes sense and that your analysis can be based off the collected data.

For example, to time count.sh or count.pl, you can do the following:

 time ./count.sh your-file

The syntax is the same for count.pl. Use the User time value that time returns.

Ve. README

Your README file needs to answer the following questions:

Who is in your group?
Acknowledgment of any assistance you received from anyone or anywhere but your team members, the 326 staff, or the Weiss book.
How long do we think the project will take? (answer before starting)
How long did the project take?
Which sorting algorithm did you implement for WordCount? Why? Under what circumstances might other sorting algorithms work better?
Which data structure do you expect will be the fastest? (answer before starting)
Which data structure is the fastest? Why were you right or wrong?
In general, which StringCounter dictionary implementation was "better": trees or hash tables? Note that you will need to define "better" (ease of coding, ease of debugging, memory usage, disk access patterns, runtime for average input, runtime for all input, etc).
Are there cases in which a particular data structure performs really well or badly in the correlator? Enumerate the cases for each data structure.
Give a one to two paragraph explanation of whether or not you think Bacon wrote Shakespeare's plays based on the data you collected. No fancy statistical analysis here (formal analysis comes later); keep it fun and simple.
Give a detailed description of all Above and Beyond projects which you implemented. What did you find most challenging about the projects you implemented? What did you find most interesting?
Writeup your benchmarks (this is long). Your mission is to convince us that your benchmark makes sense and that we should be interested in it if we are trying to ascertain which data structure is better suited for your input. You will need to answer at least the following (all formal analysis should answer something similar):
- What are you measuring?
- What is the definition of "better" given your measurement?
- Why is the measurement interesting in determining which is the superior algorithm for this project?
- What was your method of benchmarking?
- What were the sources of errors?
- What were your results?
- How did you interpret your results?
- What were your conclusions?
- Are there any interesting directions for future study?
You may attach this in a separate, non-plain-text file.
What literary character does Dave/Ruth most remind you of? Ethan/Jonah? David?
What did you enjoy about this assignment? What did you hate? Could we have done anything better?

VI. Files and Sample Code

Sample texts are provided in the course directory, which is located at /cse/courses/cse326/07wi/projects/project3/. You can also get texts of your own at Project Gutenburg, which has thousands of books as plain text files! Their mission is to provide electronic versions of many popular public domain texts. Check it out! Note that these files contain some header text that is not part of the actual play or book. For accurate results you should remove everything that is not part of the work. Extra text occurs at the beginning and end or each file, and sometimes at the beginning of each act of a play. Try running your word-counting program on the King James Bible. (Guess which word comes up more frequently in the Bible: "he" or "she?"... and by a factor of what?). Also, if you have any special requests for texts or other cool files you'd like to have added to the test files, email the course staff.

In addition, your history professor has provided some code which he wrote (these days, everybody knows how to program!). You may use it if you wish, although your code must follow the provided StringCounter interface:

StringCounter - Specification of an interface for a StringCounter. Your classes must implement this interface. (Note that StringCounter is a dictionary that is specialized for this particular task, so it isn't as generalized as some of the ADTs we've seen in the past. This is primarily for performance reasons.)
BST - Specification and implementation of an unbalanced binary search tree class. Use of the provided BST implementation is optional (you may choose to implement your own), but your AVL and Splay tree classes must inherit from your BST implementation.
BST.BSTNode- Specification and implementation of a binary search tree node (used in the BST class).
FileWordReader - A class that reads in a file and does simple word stemming.
WordCount - A simple program that reads words from a FileWordReader and tallies their frequency in a StringCounter.
count.sh - A Unix shell script to compute word counts. Usage: ./count.sh your-file
count.pl - Similar to count.sh, except in Perl.

VII. Turnin

Code Turn-in

For the code turn-in, you are expected to turn in all your code (Data structures, WordCount, Correlator, your unit tests, and any other code of interest). You do not have to include any Above and Beyond code at this time. Note that the code you submit for the Code turn-in should be functional, fully-tested code.

Writeup and Benchmarking turn-in

For the final checkin, include all of the above, your readme, your benchmark analysis, and any extra credit stuff you implement. You may turn in an MS-Word document, a PDF file, or some format other than a plain-text file, especially for the analysis part. If you're working with a partner, the same person should submit for the in-progress checkin and the final checkin.

Printout

After the final checkin, you will need to bring a printed copy of your readme, your benchmark analysis, and any extra credit you implemented to class. Note that you do not need to print out your code. Make sure the names and quiz sections of you and your partner are labeled clearly at the top of each file you print.

VIII. Grading Breakdown

Each part of the assignment will be worth (approximately) the following percentages. Please budget your time appropriately.

35% Program correctness (including boundary cases) and error-free compilation

20% Architecture/design of program. Style, commenting, layout. Other forms of program documentation.

35% README and answers to writeup questions, including benchmarking

10% Quality and comprehensiveness of turned-in unit tests

IX. Going Above and Beyond

Algorithm Design Techniques -
If you wrote your tree algorithms iteratively, re-implement them to be recursive (or vice versa), and answer the following questions in your README: Which algorithm design technique did you find easier to code? Which was more elegant? Had a faster runtime? How would you define "better", and which design technique do you think is "better" in this program? Can you think of situations where one design technique is not applicable?
Alternative Trees -
Implement the program with Red-Black trees (see textbook). The advantage of Red-Black trees is that you can write non-recursive insertion/deletion algorithms (the trade-off is that Red-Black trees have weaker balancing condition, though they do guarantee O(log(n)) depth). Don't cheat; write non-recursive algorithms here. In your README, comment on which tree implementation was easiest to write/debug, which was the fastest, and, if you needed to write a tree for general use (eg, a tree to be used by all the 326 students for all their projects), which would it be: an unbalanced BST, an AVL tree, a splay tree, or a red-black tree? Why?
Keeping Performance Information -
Add code to your program so that you can track the number of comparisons and the number of rotations performed by your tree. For this project, you will need to have implemented both a splay tree and an AVL tree. Predict how the two trees would compare. How did they actually compare? Were you surprised?
Alternative Hashing Strategies -
If you wrote a closed-hashing table, implement linear, quadratic, and one other probing strategy (you may make up your own, if you wish). The user should be able to select their probing strategy with command line arguments. Does one probing strategy always work better/worse than the others? Why do you think this is the case? Are there types of input for which one probing strategy works better than another? Which has a greater impact on your hash table's performance: the hash function, or the probing strategy? If you wrote an open-hashing table, implement a secondary dictionary instead of a linked list (perhaps you can reuse your tree implementation?). In your README, answer the following questions: does a secondary dictionary increase or decrease the runtime for your hash table for all inputs? On some inputs? How difficult was it to implement a secondary dictionary?
Data Locality -
Add code to your binary search tree which keeps track of the average depth of a node in the tree over the course of a run. Compare the average depths of some very common and some very uncommon words for unbalanced binary search trees, AVL trees, and splay trees over the course of parsing a file.
Profiling -
Profile your program using hprof A profiler is a tool which enables the programmer to obtain detailed information about how their program performs, such as the number of times a function is called, or how much of the program's runtime was spent in a particular function call. Compare two tree or two hash table implementations using a profiler. Your README should include a paragraph with the following information:
- Your expected performance bottlenecks
- How your expectations differed (or were the same) as the profiler's results.
- Did you find anything unusual/unexpected?
- What is the biggest bottleneck for your program, and what you think may be the cause.
Deletion -
Currently, the StringCounter interface which we have provided does not support deletion of elements. Add deletion to the interface and to all data structures that you've written which implement this interface. We'll also need to test your code, so extend your unit tests for your interfaces to read input from the command line that tests deletion. Document this interface in your README. It's not enough to use static tests in your code; we should be able to run new tests on your dictionaries without changing your source code. What should this interface be? You decide! Easy to use and expressive dictionary test interfaces will receive more extra credit.
Algorithmic Analysis -
Implement selection sort, an optimal sorting algorithm, and a third algorithm that does not have O(n²) running time -- examples are another optimal sorting algorithm, radix sort, or bucket sort. These algorithms should have a nice command-line interface for timing tests. In your README, include a short paragraph comparing the three algorithms, including a table or graphs. For extended extra credit, vary the distribution of the input so that the relationships between the running times of your various sorting algorithms change - this will especially make a difference if bucket sort is implemented. Your README should describe what distributions you used, and why they made a difference.
Introspective Sort -
Introspective sort is an unstable QuickSort variant which switches to HeapSort for inputs which would result in a O(n²) for normal QuickSort. Thus, it has an average-case and a worst-case runtime of O( n log₂ n ), but generally runs faster than HeapSort even in the worst case. Implement IntroSort, and give a sample input which would result in a quadratic runtime for normal QuickSort (using a median-of-3 partitioning scheme).
Iterators -
Add iterators to the StringCounter interface and implement them for the classes which implement StringCounter. In your README, comment on whether you think the added complexity of writing an iterator outweighs the simplification of your algorithms, and any difficulties you found while writing your iterators.
Visualization -
We have provided you with a primitive method for printing out trees. Make a full-blown tree visualization tool to better test and debug tree code. This option is worth two "Above and Beyond" projects.
Word Stemming -
Word stemming is a process in which:
- Common word suffixes like "ing", "ed", and "ly" are removed,
- Plural words are made singular,
- ... well, you get the picture. Other simplifications include removing prefixes and changing verbs to nouns if possible.
So, a word-stemming word-frequency-counter would count the word "buffalo" twice in the following sentence: "The bald buffalo charged at the herd of shaggy buffaloes".
Note that simply removing certain letters or strings at the end of words will not work: "Swiss" is not the plural of "Swis", nor is "bed" the past tense of "b". Simple word-stemming algorithms can be found on the internet, so your best reference will probably be a search engine like Google. Please only use the web as an algorithm reference; do not copy code directly.
As with document correlation, word stemming is another fairly complex topic. A common algorithm for doing word stemming is the Porter Stemming Algorithm. Implementing this algorithm is Above and Beyond (they have source code posted. If you try to do this project, please try to implement the algorithm from scratch only referring to their source if absolutely necessary. If you end up looking at their source, be sure to cite it.). Stemming algorithms of interest include the Porter Stemming Algorithm (a complicated but very widely-used 5-step suffix-removal algorithm) and the Paice/Husk Stemming Algorithm (a simpler iterative suffix-remover). This option is worth two "Above and Beyond" projects.
Word co-occurance -
A phenomenon even more interesting than word frequency is word co-occurrence. Create a new WordCount that counts the frequency of pairs of words. Your code should insert as a pair any two words which appear within k words of each other, where k is a command line argument (try 10 for test runs). How do BST, AVL, and splay trees compare now? This option is worth two "Above and Beyond" projects.
Latent Semantic Analysis -
The underlying theory behind word co-occurrence is what is known as Latent Semantic Analysis. Check out the LSA website at Colorado University for more information, and modify the word_count program to find possible polysemies or synonymies. This option is worth three "Above and Beyond" projects.
Go crazy! -
Have a cool idea that's not on this list? Then go for it! If you want to go drastically beyond the basic project specifications, check with Ruth or the TAs before you start. Of course, your code should still meet the basic requirements.

X. Interesting Tidbits

Word frequency analysis plays an important role in Cryptanalysis, the science of breaking secretly encoded messages. The first mention of using the frequency distribution of a language to break codes was in a 14-volume Arabic encyclopedia written by al-Qalqashandi in 1412. The idea is attributed to Ibn al-Duraihim, the brilliant 13th century Arab Cryptanalyst. If you are interested in cryptography, be sure to check out The Code Book by Simon Singh. This is great introduction to cryptography and cryptanalysis.
Think computer science is all fun and games? The Codebreakers, by David Kahn, is a fascinating look at many of the problems solved by crypotanalysis, from breaking WWII secret messages to the very question of who wrote Shakespeare's works!

XI. Credits

This assignment is starting to become a fixture in 326. The first word counter appeared during winter 2000, when 326 was taught by the famous Steve Wolfman. The snowy-bearded historian appeared in Autumn 2001, courtesy of Adrien Treuille. This assignment has been otherwise tweaked over the years by Matt Cary, Raj Rao, and Ashish Sabharwal. And this quarter, it has been adjusted again.

35%	Program correctness (including boundary cases) and error-free compilation
20%	Architecture/design of program. Style, commenting, layout. Other forms of program documentation.
35%	README and answers to writeup questions, including benchmarking
10%	Quality and comprehensiveness of turned-in unit tests

Project 3: Shake-n-Bacon Data Mining with Trees and Tables