Last Updated: 2014 April 10, by Ruth Anderson
CSE 332 Spring 2014

CSE 332 Project 2: Shake-n-Bacon

Outline

Due Dates & Turn In
Working with a Partner
Introduction to Word Frequency Analysis
Learning Objectives
The Assignment
Provided Code
Phase A Programming
Phase B Programming
Turn in Information
Above and Beyond (Part of Phase B)
Interesting Tidbits
Acknowledgments

Due Dates & Turn In

Find partner: Use the Discussion Board to find a partner & Complete this Catalyst Survey by Friday April 18
Phase A: Thursday April 24, 11:59PM
Phase B: Monday May 12, 11:59PM
Turn In: Course Dropbox, *ONE* submission per team

Working with a Partner

You are strongly encouraged, but not required, to work with a partner. You can use the Discussion Board to find a partner, and you should complete this Catalyst Survey by the above due date if you wish to work as a team. No more than two students may form a team. You may divide the work however you wish, but place the names of both partners at the top of all files (You may attribute one partner as the primary author). Team members will receive the same project grade unless there is an extreme circumstance (notify us before the deadline).

If you work on a team, you should:

Turn in only *ONE* submission per team for each phase.
Work together and make sure you both understand the answers to all write-up questions.
Understand how your partner's code is structured and how it works.
Test all of your code together to be sure it properly integrates.

Introduction to Word Frequency Analysis

A world famous UW history professor wants you to settle a very old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside your area of expertise. "Oh, no," chuckles the historian, stroking his snowy white beard. "I need a Computer Scientist! Someone who can efficiently compare any two documents in terms of their word frequency."

Authors tend to use some words more often than others. For example, Shakespeare used "thou" more often than Bacon. The professor believes a "signature" can be found for each author, based on frequencies of words found in the author's works, and that this signature should be consistent across the works of a particular author but vary greatly between authors.

The professor gave you copies of Shakespeare's writing (Hamlet) and Bacon's writing (The New Atlantis), which he has painstakingly typed by hand from his antique, leather-bound first-editions. Being good scientists, however, you quickly realize that it is impossible to draw strong conclusions based on so little data, and asking him to type more books is out of the question! Thus, for Phase B you should download and analyze several more works, as many works as you feel is necessary to support your conclusion. Project Gutenberg is a good place to look.

Learning Objectives

In this project, you will:

Be introduced to various DataCounter implementations, a variant of Dictionary
Gain proficiency with heaps, AVL trees, and hashtables
Implement two idioms related to writing generic and reusable code: function objects and iterators
Implement sorting algorithms
Gain experience with unit testing and the JUnit tool
Gain preliminary experience with timing code and evaluating experimental data

The Assignment

This assignment consists of two phases. Each phase includes implementation of data structures, JUnit tests, and programs manipulating those structures. Phase B also involves collecting experimental data and completing write-up questions. You will implement different data-counters and sorting routines to see different ways to solve the problem. A large part of your work will involve using the WordCount program. WordCount should output the frequency of each word in the document, starting with the most frequent words and resolving ties in alphabetical order as shown below:

     
     970     the 
     708     and 
     666     of 
     632     to 
     521     at 
     521     i 
     521     into 
     466     a 
     444     my 
     391     in 
     383     buffalo 
     ...

Note that the actual printing code, along with many other parts, are provided. The printing format should exactly match what is shown above (use the provided printing code). Do not make WordCount print any other extra outputs such as number of words, time it takes to run, etc. Your WordCount should work as specified when given correct input parameters as shown below. Print an appropriate error message and terminate the program (System.exit) if an incorrect or incomplete set of parameters are given. As provided, Wordcount.java only processes a single parameter for the filename. You will need to modify it to process all parameters. If you call WordCount from the command line, it should take parameters as follows (in eclipse you will just add the 3 arguments listed below):

java WordCount [ -b | -a | -m | -h ] [ -is | -hs | -os | -k <number>] <filename>

Argument 1: DataCounter implementation. -b for BinarySearchTree (provided), -a for AVLTree (phase A), -m for MoveToFrontList (phase A), or -h for HashTable (phase B).
Argument 2: Sorting routine. -is for Insertion sort (provided), -hs for Heap sort (phase A), -os for Other sort (phase B), or -k followed by a number for Top-k sort (phase B).
Argument 3: Input file name.

For example,

Binary Search Tree & Insertion Sort:	`java WordCount -b -is hamlet.txt`
AVL Tree & Heap Sort:	`java WordCount -a -hs hamlet.txt`
Hash Table & Top-k Sort (k=5):	`java WordCount -h -k 5 hamlet.txt`

Provided Code

Download the project to get started:

CSE332_Project2.zip

Unzip and place the CSE332_Project2 folder in your workspace folder. From eclipse, import it using file -> import -> General -> Existing projects into workspace -> select CSE332_Project2 as root directory. We believe the provided code is correct, but let us know if you see any problems. Note that the provided code will not compile or run correctly until you complete "1. Getting Started" of Phase A. Here are brief descriptions of the main provided files in roughly the order we suggest you read them.

DataCount.java: A simple container for an object and an associated count.
DataCounter.java: Like a dictionary, except it maintains a count for each data item. We do not require the element type E to be "comparable". Instead, constructors will take function objects of type Comparator and use them to do comparisons. Also notice that a DataCounter provides an iterator SimpleIterator<DataCount<E>>.
SimpleIterator.java: The iterator that a DataCounter must provide. We do not use Java's iterator type because it obligates us to certain rules that are difficult to implement (if curious, read about concurrent modification exceptions).
Comparator.java: The type for function objects that do comparisons between pairs of elements. Constructors of DataCounter implementations should take it as an argument, as illustrated in BinarySearchTree.java.
DataCountStringComparator.java: An implementation of Comparator that orders two DataCount<String> objects. It requires that you correctly implement StringComparator.
BinarySearchTree.java: An implementation of DataCounter using a binary search tree. It is provided as an example of how to use function objects and iterators. The iterators you write will not be as difficult.
TestBinarySearchTree.java: A class containing JUnit tests for BinarySearchTree. It is provided as a simple example JUnit test.
WordCount.java: Processes a text file and prints out the words from the file in frequency order.
FileWordReader.java: Handles input files, converting each word into lower case and removing punctuation, so that every string you process will contain just the 26 lowercase English letters.
GStack.java: Same generic stack interface you saw in Project 1.
Heap.java: Interface for a Heap (priority queue) that your FourHeap should implement. Your Four Heap should use a function object for comparisons, just like the DataCounter implementations.
Hasher.java: Interface for a function object your hashtable implementation will want to take in as a parameter.

Note that you have been provided several other files in addition to the ones listed above. These other files are mostly full of stubs that you must fill in (there are instructions in the comments inside of those files). You will need to implement classes in phaseA, phaseB, testA and testB packages and make several changes to WordCount.java and Sorter.java. You should NOT modify the providedCode package at all. You should *NOT* modify the provided structures, such as package names, file names, interfaces of public methods and constructors, etc. You also should *NOT* move the provided files to other packages. However, you may add new files (add files to packages *other* than providedCode), and you may add new packages to the project as needed. You can add or modify private methods, and you can add public methods and constructors as long as doing so does not violate the style guidelines (i.e. too many unnecessary methods/classes, new method exposes sensitive internal data of the class).

Phase A Programming

In Phase A, you will work on the simpler and helpful subproblem of computing the frequency of words in a text. HashTable, Other sort and Top-k sort are part of Phase B, so your Phase A submission may simply ignore the parameters -h, -os and -k <number> or print an error message. Note that the write-up questions are all submitted with Phase B, but some of them refer to work you did in Phase A, so you may wish to start on them early.

1. Getting Started

After this step, running:
java WordCount <filename>
should produce the correct output (implicitly using -b -is, since you do not yet process them). For the code to compile and generate the correct output, you need to first implement the following:

GArrayStack: Needed for BinarySearchTree iterator. You may use your code from Project 1.
StringComparator: Used by both data counters and sorting algorithms. Because of how the output must be sorted in the case of ties, your implementation should return a negative number when the first argument to compare alphabetically comes before the second argument. Do not use any String comparison provided in the Java library; the only String methods you are allowed to use are length and charAt.
WordCount.getCountsArray: The provided code returns with an error. Your code should use the argument's iterator to retrieve all the elements and put them in a new array. The code you write is using ( not implementing) a SimpleIterator. If you have trouble with casting, take another look at these notes on generic arrays.

2. Adding Data Counter Implementations

Provide two additional implementations of DataCounter as described below. You should provide the methods defined in DataCounter. Follow the instructions & hints found inside of each file listed below. Then, adjust WordCount to use your new DataCounter implementations with the appropriate command-line arguments.

AVLTree: A self-balancing search tree, subclass of BinarySearchTree
MoveToFrontList: Type of linked list where new items are inserted at the front of the list, and an existing item gets moved to the front whenever it is referenced. Although it has O(n) worst-case time operations, it does well when some words are much more frequent than others, because frequent ones will be near the front.

3. Adding a Sorting Algorithm

Implement Sorter.heapSort and FourHeap as explained below. For full credit, heapSort should be generic. You do not need to implement an in-place heapsort as was discussed in lecture. Then, adjust WordCount to use your heapSort method when the second command-line argument is -hs.

heapSort: Consists of two steps 1) Insert each element to be sorted into a heap (FourHeap) 2) Remove each element from the heap, storing them in order in the output array.
FourHeap: Heap with 4 children, subclass of Heap (Hint: Complete written homework #2 before attempting this). Follow the instructions & hints found inside of FourHeap.java and Heap.java.

4. JUnit Tests

Place your tests in packages test and testA, in these files: TestDataCounter (Optional), TestSorter, TestAVLTree, TestMoveToFrontList and TestFourHeap. TestBinarySearchTree is already implemented as a simple example; Note that it is a subclass of TestDataCounter, which can be used to factor out the common testing code of different data counter implementations (since all data counters should have the same external behavior). However, you don't have to use TestDataCounter if you are new to JUnit or prefer to have copies of code in different test classes. We don't care too much about style in testing code other than naming & number of assertions, as long as your tests are thorough. We expect your JUint tests to be more thorough than TestBinarySearchTree, testing both internal and external functionality. To test internal functionality, consider writing a public method for testing in the class being tested and call that method in your JUnit tests & compare with expected output. In general, you should not make public methods that would expose your private data solely for testing purposes (i.e. writing methods like getOverallRoot() in AVLTree for testing is bad style, since it allows external code to modify the root of the tree!). What I mean by "make a public method for testing" is make a method that actually tests your class using its fields. You do not need to test functionality you did not implement.

Phase B Programming

In Phase B, you will provide an additional DataCounter implementation, an additional sorting implementation, and compute a numeric value that quantifies the "similarity" of two texts. You will also perform some simple experiments to determine which implementations are the fastest. In this phase, we purposely give somewhat less guidance on how to organize your code and implement your algorithms. It is up to you to make good design decisions. You should fix any bugs and design weaknesses of your Phase A code at this time as well, since it will be more thoroughly graded after Phase B is turned in.

1. Adding more Data Counter and Sorting Algorithms

Implement HashTable, StringHasher, Sorter.otherSort, and Sorter.topKSort. Follow the instructions & hints at the top of each provided file. Adjust WordCount.java to process appropriate parameters. For the topK option, if the second argument is -k, then the third argument should be a number (and the fourth is the filename). When the -k <number> option is used, WordCount should print out only the <number> most frequent words, resolving ties alphabetically. If the text has fewer than k distinct words, print all of them.

HashTable: You may implement any kind of HashTable discussed in the class. Your HashTable should rehash as appropriate (use an appropriate load factor as discussed in the class), and its capacity should always be a prime number. Your HashTable should grow, and it should be able to handle an input size larger than the sample text files given to you (It should be able to grow at least up to 200,000).
StringHasher: To use your HashTable in WordCount, you need to hash strings. Implement your own hashing strategy for strings. Do not use Java's hashCode method. As in StringComparator, the only String methods you are allowed to use are length and charAt.
Sorter.otherSort: Implement either quicksort or mergesort and use it with the -os option. As with insertionSort and heapSort, your sorting code should be generic. Your sorting algorithm should meet its expected runtime bound.
Sorter.topKSort: An easy way to implement this would be to sort all the words by frequency as usual and then just print k of them. This approach finds the k most frequent words in time O(n log n). However, your implementation should have O(n log k) runtime, assuming k is less than or equal to n (Hint: Use a heap, but never put more than k elements into it. Think about why this gives O(n log k) runtime bound). Efficiently tracking the k most frequent words will require a different comparator than what you used in heapSort.

2. JUnit Tests

Place your HashTable tests in TestHashTable and add appropriate testing code to TestSorter for otherSort and topKSort.

3. Document Correlation

How to quantify the similarity between two texts is a large area of study. We will use a simple approach, leaving more advanced models such as Latent Semantic Indexing to Above and Beyond. Add code to Correlator.java to implement this simple approach:

Calculate word counts for the two documents and use each document's length to create normalized frequencies so that frequencies can be meaningfully compared between documents of different lengths (i.e., use frequency percentages).
Ignore words whose normalized frequencies are too high or too low to be useful. A good starting point is to remove or ignore words with normalized frequencies above 0.01 (1%) or below 0.0001 (0.01%) in both documents.
Then, for every word that occurs in both documents, take the difference between the normalized frequencies, square that difference, and add the result to a running sum. This should be done using a *SINGLE* iterator. This means only 1 iterator being used in Correlator.java, *NOT* 1 iterator per DataCounter (You should call dataCounter.getIterator() just once in Correlator.java). Hint: Take advantage of DataCounter's methods.
The final value of this running sum will be your difference metric, which corresponds to the square of the Euclidean distance between two vectors in the space of shared words in two documents. Note that this ignores any word that does not appear in both documents, which is probably the biggest weakness of this metric.
Correlator.java should output a single number which is the computed correlation as described above (Do *NOT* print any extra output other than a single number printed from the provided printing code).
Comparing Hamlet and The New Atlantis computes a similarity of 5.657273669233966E-4. Your answer should be very close to this one (like 5.65727366923****E-4, where * can be any number).
Comparing documents can be fun. Feel free to post links to interesting text examples on the course message board.

Correlator.java should accept command line parameters indicating which DataCounter to use:
java Correlator [ -b | -a | -m | -h ] <filename1> <filename2>

For example,

Using BinarySearchTree:	`java Correlator -b hamlet.txt the-new-atlantis.txt`
Using HashTable:	`java Correlator -h hamlet.txt the-new-atlantis.txt`

4. Experiment & Write-up (Due with Phase B)

Submit a README.pdf file, answering the questions in this template README file: (README.docx, README.pdf)

The last three write-up questions will ask you to design and run some experiments to determine which implementations are faster for various inputs. Answering these questions will require writing additional code to run the experiments, collecting timing information and producing result tables and graphs, together with relatively long answers. Do not wait until the last minute! Insert tables and graphs in README.pdf as appropriate, and be sure to give each one a title and label the axes for the graphs. Place all your timing code into the package writeupExperiment. Be careful not to leave any write-up related code in the phase A & B files (such as printing runtime in WordCount). To prevent losing points due to the modifications made for the write-up experiments, we recommend that you copy all files that need to be modified for the experiments into the package writeupExperiment, and start working from there. Files in different packages can have the same name, but when editing be sure to check if you are using the correct file!

You need to write a second hashing function for Question 10. To exaggerate the difference between the two hash functions, you would want to compare a very simple hash function with a decent one (the one used in phase B).

In phase B, you ignored words with normalized frequencies above 0.01 (1%) or below 0.0001 (0.01%) in both documents. When determining who wrote Shakespeare's plays for question 12, feel free to play with these percentages.

All graphs and tables should be inserted into your README file. For all experimental results, we would like to see a detailed interpretation, especially when the results do not match your expectations. For more stable results, we recommend averaging the runtime of multiple runs as shown in the example timing code below. Java optimizes repeated operations, so it runs faster at the end of the loop. Throw away several runs at the beginning of the loop to encounter this effect (JVM warmup).

  private static double getAverageRuntime(String[] args) {
    double totalTime = 0;
    for(int i=0; i<NUM_TESTS; i++) {
      long startTime = System.currentTimeMillis();
      WordCount.main(args);
      long endTime = System.currentTimeMillis();
      if(NUM_WARMUP <= i) {                    // Throw away first NUM_WARMUP runs to encounter JVM warmup
        totalTime += (endTime - startTime);
      }
    }
    return totalTime / (NUM_TESTS-NUM_WARMUP);  // Return average runtime.
  }

Turn in Information

Important reminders:

You are allowed to add methods, classes and packages, but you should not change the names & interfaces of the given packages, classes and public methods and constructors. Also do not move the given classes to other packages. (Adding is ok, deleting & modifying public interfaces is not allowed)
You should not edit the files in the package providedCode.
You should not use any Java libraries in your implementation (using methods from the Math package is ok), except in your JUnit tests and timing code. Make sure you have read the clarifications page.
Do not change the output (printing) format of WordCount and Correlator (used for grading). Also, do not produce any extra output other that what is printed using the provided printing code (for example, do not print the runtime in WordCount).
For all subclasses of DataCounter, do not override the toString method of DataCounter (used for grading).
Make sure your code is properly documented (recall the Programming Guidelines).

Phase A:

Zip the src folder and turn in src.zip
Please make sure that src.zip decompresses its contents into a folder called src, which contains the folders main, phaseA, phaseB, providedCode, test, testA, testB and writeupExperiment.
It is fine to have added some phase B stuff at this time, as long as everything compiles and all phase A code works as expected.

Phase B:

Zip the src folder and turn in src.zip
Turn in README.pdf
If you did any extra credit, turn in extra.zip. All extra credit stuff should be in extra.zip, do not place any extra credit stuff in src.zip.
Please make sure that src.zip decompresses its contents into a folder called src, which contains the folders main, phaseA, phaseB, providedCode, test, testA, testB and writeupExperiment.
You should fix bugs and address any design weaknesses in your Phase A code at this time as well. It will be more thoroughly graded after the Phase B turn in.

Above and Beyond (Due with Phase B) (optional)

You may do any or all of the following; pick ones you find interesting. Please be sure that you finished your project completely before attempting any of these. We advise you to start on these only if you have time left after thoroughly testing all your code and carefully checking all your writeup answers. Recall that any extra credit you complete is noted and kept separate in the gradebook and may or may not be used to adjust your grade at the end of the quarter, as detailed in the course grading policy. Submit your extra credit files separately in extra.zip, and provide a very detailed explanation for each of your implementations in question 4 of README.pdf.

Alternate Hashing Strategies: Implement both closed and open addressing and perform experimentation to compare performance. Also design additional hashing functions and determine which affects performance more: the cost of hashing, the collision-avoidance of hashing, or your addressing strategy.
Excluded Words: Adjust WordCount and Correlator to take an additional file that contains words that should be ignored. For example, the file might contain very common words (like the, a, and she) or it might be another text (so WordCount would then report words not appearing in the other text). There are multiple ways to implement this feature: you could skip words-to-be-ignored as you process the input file, or when sorting the elements, or when printing them. Which is fastest and why do you think this is? What data structure did you use to determine efficiently if a word should be ignored and why is this efficient? Note that for top-k, you should still print k words (assuming there are k words that should not be ignored). How does this affect the ways you can implement this skip-some-words feature?
Introspective Sort: Introspective sort is an unstable quicksort variant which switches to heapsort for inputs which would result in a O(n²) running-time for normal quicksort. Thus, it has an average-case and a worst-case runtime of O(n log n), but generally runs faster than heapsort even in the worst case. Implement IntrospectiveSort, and give a sample input which would result in a quadratic runtime for normal quicksort (using a median-of-3 partitioning scheme).
Word Stemming: When working with document correlations, it is often desirable to work only with the roots of words. In this way, "sleeps", "sleeping", and "sleep" are all considered to be the same word. This process is called word stemming, and is used in search engines and many other contexts. However, proper stemming is difficult. Dropping an 's' from the end of a word seems natural but turns "bus" into "bu". Dropping punctuation seems like a good idea but it equates "it's" and "its". One common algorithm is Porter Stemming. Implement this algorithm. Try to do so without looking at the source code posted at the link. If you do look at their source code, cite it properly.
Word Co-Occurrence: A phenomenon even more interesting than word frequency is word co-occurrence. Create a new WordCount that counts the frequency of pairs of words. Your code should insert as a pair any two words which appear within c words of each other, where c is a command-line argument (try 10 for test runs).
Better Correlation Models: Research better ways to quantify the correlation between two text documents and what their advantages/disadvantages are. Describe what you found and implement an algorithm more sophisticated than the simple one in Correlator. Explain how your algorithm affects your conclusions.

Interesting Tidbits

Word-frequency analysis plays a central role in providing the input data for tag clouds. There are many uses for tag clouds, such as indicating words that are more common in some writing (e.g., someone's blog) than they are more generally (e.g., on all web pages).
Word-frequency analysis also plays an important role in Cryptanalysis, the science of breaking secretly encoded messages. The first mention of using the frequency distribution of a language to break codes was in a 14-volume Arabic encyclopedia written by al-Qalqashandi in 1412. The idea is attributed to Ibn al-Duraihim, the brilliant 13th century Arab Cryptanalyst. If you are interested in cryptography, be sure to check out The Code Book by Simon Singh. This is great introduction to cryptography and cryptanalysis.
Think computer science is all fun and games? The Codebreakers, by David Kahn, is a fascinating look at many of the problems solved by crypotanalysis, from breaking WWII secret messages to the very question of who wrote Shakespeare's works!

Acknowledgments

A variant of this project was the third and final project in CSE326 for many years. The first word counter appeared during Spring 2000 when 326 was taught by the famous Steve Wolfman. The snowy-bearded historian appeared in Autumn 2001, courtesy of Adrien Treuille. The assignment was subsequently tweaked over the years by Matt Cary, Raj Rao, Ashish Sabharwal, Ruth Anderson, David Bacon, Hal Perkins, and Laura Effinger-Dean. Dan Grossman adapted it to be the second project in CSE332. James Fogarty stayed up late during the snow-pocalypse of 2012 to clarify confusions experienced in previous offerings (which had been carefully detailed by Ruth Anderson).