Last Updated: October 10, 2013

Shakespeare CSE 332 Autumn 2013: Project 2 - Shake-n-Bacon Bacon

Outline

Due Dates and Turn-In

Use the course dropbox to submit your assignment electronically.

Working with a Partner

You are strongly encouraged, but not required, to work with a partner of your own choosing for this project. You may work with somebody you already know or use the course message board to find a partner. No more than two students total may form a team. You may divide the work however you wish, under three conditions:

Other logistics:

Test all of your code together to be sure it properly integrates. Start this early, and do not attempt to merge your code at the last minute as you will very likely experience problems with integration. You may wish to set up an SVN repository to make collaboration easier. If so, contact the course staff regarding project space.

Team members will receive the same project grade unless there is an extreme circumstance and you notify us in advance of the deadline.

If you plan to work with a partner, one partner MUST send HyeIn an email using this link. In your email, provide the following information for BOTH partners: full name, UWNetID, CSE Email address.

Introduction and Word-Frequency Analysis

You have just been approached by a world famous UW history professor. He would like you to settle a very old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside your area of expertise. "Oh, no," chuckles the historian, stroking his snowy white beard. "I need a Computer Scientist! Someone who can efficiently compare any two documents in terms of their word frequency."

Authors tend to use some words more often than others. For example, Shakespeare used "thou" more often than Bacon. The professor believes a "signature" can be found for each author, based on frequencies of words found in the author's works, and that this signature should be consistent across the works of a particular author but vary greatly between authors. In Phase A, you will work on the simpler and helpful subproblem of computing the frequency of words in a text. In Phase B, you will compute a numeric value that quantifies the "similarity" of two texts.

Perhaps your work can be applied to the works attributed to Shakespeare and Bacon to settle the ancient debate. The professor has provided you with copies of Shakespeare's writing (Hamlet) and Bacon's writing (The New Atlantis), which he has painstakingly typed by hand from his antique, leather-bound first-editions. Being good scientists, however, you quickly realize that it is impossible to draw strong conclusions based on so little data, and asking him to type more books is out of the question! Thus, for Phase B you should download and analyze several more works, as many works as you feel is necessary to support your conclusion. Project Gutenberg is a good place to look.

When working with document correlations, it is often desirable to work only with the roots of words. In this way, "sleeps", "sleeping", and "sleep" are all considered to be the same word. This process is called word stemming, and is used in search engines and many other contexts. However, proper stemming is difficult. Dropping an 's' from the end of a word seems natural but turns "bus" into "bu". Dropping punctuation seems like a good idea but it equates "it's" and "its". Implementing a decent stemming algorithm (such as Porter Stemming) is above and beyond work for this project.

The code provided to you does not really do stemming, but it does normalize the input as follows:

Hence every string you process will contain just the 26 lowercase English letters (although this doesn't deeply impact the project).

Learning Objectives

For this project, you will:


The Assignment

This assignment consists of two phases. Each phase includes implementation of data structures, JUnit tests, and programs manipulating those structures. Phase B also involves collecting some experimental data and completing write-up questions.

A large part of your work will involve using the WordCount program. Running the WordCount program should output the frequency of each word in the document, starting with the most frequent words and resolving ties in alphabetical order. For example, output might look like this:

     970     the 
     708     and 
     666     of 
     632     to 
     521     at 
     521     i 
     521     into 
     466     a 
     444     my 
     391     in 
     383     buffalo 
     ...

Note that the actual printing code, along with many other parts, are provided. You will implement different data-counters and sorting routines to see different ways to solve this problem. We will explain all the command-line options you need to support, but we do not recommend that you start by implementing all of these. Instead, go one step at a time, and add command-line options when it is convenient. Overall, the WordCount program takes arguments as follows:

java WordCount [ -b | -a | -m | -h ] [ -is | -hs | -os | -k <number>] <filename>

For example, at the end of phase A, your code should work with command lines like the following:

java WordCount -b -is filename

java WordCount -a -hs filename

You will need to modify Wordcount.java to process these parameters. As provided, it only processes a single parameter for the filename.


Provided Code

Download the project to get started:

project2.zip

Unzip and place the project2 folder in your workspace folder. From eclipse import project2 (right click on package explorer -> import -> General -> Existing projects into workspace -> select project2 as root directory). The files are described below, including how the code is organized. Note the provided code will not compile or run correctly until you complete the first part ("1. Getting Started") of Phase A.

We believe the code provided to you is correct (let us know of problems). You will need to implement classes in phaseA, phaseB, testA and testB packages and make several changes to WordCount.java. Here are brief descriptions of the main provided files in roughly the order we suggest understanding them.

DataCount.java
A simple container for an object and an associated count.
DataCounter.java
Like a dictionary, except it maintains a count for each data item (instead of storing an arbitrary value). We do not require the element type E to be "comparable". Instead, constructors for implementations will take function objects of type Comparator and use them to do comparisons. Also notice that a DataCounter provides an iterator SimpleIterator<DataCount<E>>.
SimpleIterator.java
The iterator that a DataCounter must provide. We do not use Java's iterator type because doing so obligates us to certain rules that are difficult to implement properly (if curious, read about "concurrent modification exceptions").
Comparator.java
The type for function objects that do comparisons between pairs of some element type. Constructors of DataCounter implementations should take a Comparator as an argument, as illustrated in BinarySearchTree.
BinarySearchTree.java
An implementation of DataCounter using a (you guessed it) binary search tree. You should not modify this file, but it is a good example of how to do function objects and iterators. The iterators you write will not be as difficult.
WordCount.java
The primary file for Phase A. You will need to make several changes to this file. It processes a text file and prints out the words from the file in frequency order. More detail is provided in the description below.
TestBinarySearchTree.java
A class containing JUnit tests for BinarySearchTree. This is just an example that may or may not help to inspire you while writing test files for the data structures you implement.
GStack.java
Same generic stack interface you saw in Project 1.
FileWordReader.java
Handles input, converting each word into lower-case and removing punctuation.
DataCountStringComparator.java
An implementation of Comparator that orders two DataCount<String> objects in the proper order for Phase A. It requires that you you correctly implement StringComparator, as detailed below.
PriorityQueue.java
Interface your heap will implement. Your heap should use a function object for comparisons, just like the DataCounter implementations.
Hasher.java
Interface for a function object your hashtable implementation will want to take as a parameter.
Note that you have been provided several other files in addition to the ones listed above. These other files are mostly full of stubs that you must fill in and contain instructions in the comments inside of those files. Please do NOT modify the structure that has been provided in terms of packages, filenames, etc.

Phase A Programming

1. Getting Started

For the code to compile and generate the correct output, you need to implement following:

At this point your program should produce the correct output (implicitly using options -b and -is, since you do not yet process command-line parameters).

2. Adding Other Data-Counter Implementations

Provide two additional implementations of DataCounter as described below. You should provide the methods defined in DataCounter. Follow the instructions & hints found inside of each file listed below.

Adjust WordCount to use your new DataCounter implementations with the appropriate command-line arguments: -a for AVL tree, and -m for MoveToFrontList. Note that the write-up questions are all submitted with Phase B, but some of them refer to work you did in Phase A. You may wish to start on them early.

3. Adding a sorting algorithm (heapSort)

Now provide a different sorting algorithm called heapSort. heapSort needs to use a priority queue, so you will also need to provide an implementation of one of those as well. In particular, you should provide an implementation of PriorityQueue in class FourHeap as follows:

Once you have a working PriorityQueue, you can use it to implement heapSort. heapSort works as follows:

Put your algorithm in a static method called heapSort in WordCount.java. For full credit, your sorting routine should be generic like insertionSort (the two should have the same arguments). Note: You do not need to implement an in-place heapsort as was discussed in lecture. Adjust WordCount to use your heapSort method when the second command-line argument is -hs.

4. JUnit Tests

Implement test files in package testA: TestAVLTree, TestMoveToFrontList, TestFourHeap. TestBinarySearchTree is already implemented.


Phase B Programming

In Phase B, you will provide an additional DataCounter implementation, an additional sorting implementation, and support for printing only the k most-frequent words in the file. You will then write a different main program that computes the similarity of two files. Finally, you will perform some simple experimentation to determine which implementations are fastest. In this phase, we purposely give somewhat less guidance on how to organize your code and implement your algorithms. It is up to you to make good design decisions.

1. Another Counter and Sorter

You need to provide only the methods defined in DataCounter. Follow the instructions & hints at the top of each provided file.

2. JUnit Tests

Implement test files in package testB: TestHashTable, TestOtherSort.

3. Top k Words

Extend WordCount such that if the second argument is -k, then the third argument is a number (and the fourth argument is the filename). When the -k option is used, WordCount should print out only the k most frequent words. If the text has fewer than k distinct words, then print all word frequencies. If there are ties for the k most frequent words, resolve them alphabetically so that you still print exactly k words. You should still print the words in the same order: most-frequent first with ties resolved alphabetically.

An easy way to implement this feature would be to sort all the words by frequency and then just print k of them. This approach finds the k most frequent words in time O(n log n), where n is the number of distinct words. You will need to implement this simple approach for comparison purposes in your experimentation (see the write-up questions ), but WordCount should be configured to use a more efficient approach that runs in time O(n log k) (assuming k is less than n).

Helpful Hints:

4. Document Correlation

How to quantify the similarity between two texts is a large area of study. We will use a simple approach, leaving more advanced models such as Latent Semantic Indexing to Above and Beyond. Implement this simple approach:

Fill in the class Correlator with a main method that computes the correlation between two documents. Its only output should be a single floating-point number which is the correlation computed as described above. Command-line flags should specify which DataCounter to use, as in WordCount:

java Correlator [ -b | -a | -m | -h ] <filename1> <filename2>

For example, your correlator should work with command lines like the following:

java Correlator -b filename1 filename2

java Correlator -h filename1 filename2

Comparing documents can be fun. Feel free to post links to interesting text examples on the course message board.

One implementation comparing Hamlet to The New Atlantis computes a similarity of 5.657273669233966E-4. An answer close to this suggests you are probably on the right track.

5. Experimentation (for write-up)

The write-up questions ask you to design and run some experiments to determine which implementations are faster for various inputs. Answering these questions will require writing additional code to run the experiments and collect timing information. For example, writing code that runs word-counting for various values of k will be much easier than manually running your program with different values of k.


Write-Up Questions (Due with Phase B)

Submit a README.pdf file, answering the questions in this template README file: (README.docx, README.pdf)

Note that the final three questions require writing code, collecting data, and producing results tables and/or graphs together with relatively long answers. So do not wait until the last minute. Insert tables/graphs in README.pdf as appropriate, and you can copy/paste your timing code at the end of your README.pdf file (Appendix) if you want. Formatting is not a component of your grade, as long as the report is clear and readable.


Above and Beyond (Due with Phase B)

You may do any or all of the following; pick ones you find interesting. Submit your 'above and beyond' files separately in a zip file.

  1. Alternate Hashing Strategies: Implement both closed and open addressing and perform experimentation to compare performance. Also design additional hashing functions and determine which affects performance more: the cost of hashing, the collision-avoidance of hashing, or your addressing strategy.
  2. Excluded Words: Adjust WordCount and Correlator to take an additional file that contains words that should be ignored. For example, the file might contain very common words (like the, a, and she) or it might be another text (so WordCount would then report words not appearing in the other text). There are multiple ways to implement this feature: you could skip words-to-be-ignored as you process the input file, or when sorting the elements, or when printing them. Which is fastest and why do you think this is? What data structure did you use to determine efficiently if a word should be ignored and why is this efficient? Note that for top-k, you should still print k words (assuming there are k words that should not be ignored). How does this affect the ways you can implement this skip-some-words feature?
  3. Introspective Sort: Introspective sort is an unstable quicksort variant which switches to heapsort for inputs which would result in a O(n2) running-time for normal quicksort. Thus, it has an average-case and a worst-case runtime of O(n log n), but generally runs faster than heapsort even in the worst case. Implement IntrospectiveSort, and give a sample input which would result in a quadratic runtime for normal quicksort (using a median-of-3 partitioning scheme).
  4. Word Stemming: The introduction discussed how it is difficult to remove suffixes, plurals, verb tenses, etc., from words, but it is important when performing text analysis (e.g., for web search engines). One common algorithm is Porter Stemming. Implement this algorithm. Try to do so without looking at the source code posted at the link. If you do look at their source code, cite it properly.
  5. Word Co-Occurrence: A phenomenon even more interesting than word frequency is word co-occurrence. Create a new WordCount that counts the frequency of pairs of words. Your code should insert as a pair any two words which appear within c words of each other, where c is a command-line argument (try 10 for test runs).
  6. Better Correlation Models: Research better ways to quantify the correlation between two text documents and what their advantages/disadvantages are. Describe what you found and implement an algorithm more sophisticated than the simple one in Correlator. Explain how your algorithm affects your conclusions.

Turn-in Information

Phase A:

Phase B:


Interesting Tidbits

Acknowledgments

A variant of this project was the third and final project in CSE326 for many years. The first word counter appeared during Spring 2000 when 326 was taught by the famous Steve Wolfman. The snowy-bearded historian appeared in Autumn 2001, courtesy of Adrien Treuille. The assignment was subsequently tweaked over the years by Matt Cary, Raj Rao, Ashish Sabharwal, Ruth Anderson, David Bacon, Hal Perkins, and Laura Effinger-Dean. Dan Grossman adapted it to be the second project in CSE332. James Fogarty stayed up late during the snow-pocalypse of 2012 to clarify confusions experienced in previous offerings (which had been carefully detailed by Ruth Anderson).