Last Updated: April 17, 2013

Shakespeare CSE 332 Spring 2013: Project 2 - Shake-n-Bacon Bacon

Outline

Due Dates and Turn-In

Use the course dropbox to submit your assignment electronically.

Working with a Partner

You are strongly encouraged, but not required, to work with a partner of your own choosing for this project. You may work with somebody you already know or use the course message board to help find a partner. No more than two students total may form a team. You may divide the work however you wish, under three conditions:

Other logistics:

Test all of your code together to be sure it properly integrates. Start this early, and do not attempt to merge your code on the due date. You will very likely experience problems with integration. You may wish to set up an SVN repository to make collaboration easier. If so, contact the course staff regarding project space.

Team members will receive the same project grade unless there is an extreme circumstance and you notify us in advance of the deadline.

If you plan to work with a partner, one partner MUST send HyeIn an email using this link. In your email, provide the following information for BOTH partners: full name, UWNetID, CSE Email address.

Introduction and Word-Frequency Analysis

You have just been approached by a world famous UW history professor. He would like you to settle a very old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside your area of expertise. "Oh, no," chuckles the historian, stroking his snowy white beard. "I need a Computer Scientist! Someone who can efficiently compare any two documents in terms of their word frequency."

Authors tend to use some words more often than others. For example, Shakespeare used "thou" more often than Bacon. The professor believes a "signature" can be found for each author, based on frequencies of words found in the author's works, and that this signature should be consistent across the works of a particular author but vary greatly between authors. In Phase A, you will work on the simpler and helpful subproblem of computing the frequency of words in a text. In Phase B, you will compute a numeric value that quantifies the "similarity" of two texts.

Perhaps your work can be applied to the works attributed to Shakespeare and Bacon to settle the ancient debate. The professor has provided you with copies of Shakespeare's writing (Hamlet) and Bacon's writing (The New Atlantis), which he has painstakingly typed by hand from his antique, leather-bound first-editions. Being good scientists, however, you quickly realize that it is impossible to draw strong conclusions based on so little data, and asking him to type more books is out of the question! Thus, for Phase B you should download and analyze several more works, as many works as you feel is necessary to support your conclusion. Project Gutenberg is a good place to look.

When working with document correlations, it is often desirable to work only with the roots of words. In this way, "sleeps", "sleeping", and "sleep" are all considered to be the same word. This process is called word stemming, and is used in search engines and many other contexts. However, proper stemming is difficult. Dropping an 's' from the end of a word seems natural but turns "bus" into "bu". Dropping punctuation seems like a good idea but it equates "it's" and "its". Implementing a decent stemming algorithm (such as Porter Stemming) is above and beyond work for this project.

The code provided to you does not really do stemming, but it does normalize the input as follows:

Hence every string you process will contain just the 26 lowercase English letters (although this doesn't deeply impact the project).

Learning Objectives

For this project, you will:

The Assignment

This assignment consists of two phases. Each phase includes implementation of data structures, JUnit tests, and programs manipulating those structures. Phase B also involves collecting some experimental data and completing write-up questions.

Provided Code

You need several files provided in this single zip file:

project2files.zip

Create a new Eclipse project with these files. The files are described below, including how the code is organized. Note the provided code will not compile or run correctly until you complete the first part of Phase A.

We believe the code provided to you is correct (let us know of problems). You will need to add several additional files and make several changes to WordCount.java. Here are brief descriptions of the provided files in roughly the order we suggest understanding them.

DataCount.java
A simple container for an object and an associated count.
DataCounter.java
Like a dictionary, except it maintains a count for each data item (instead of storing an arbitrary value). We do not require the element type E to be "comparable". Instead, constructors for implementations will take function objects of type Comparator and use them to do comparisons. Also notice that a DataCounter provides an iterator SimpleIterator<DataCount<E>>.
SimpleIterator.java
The iterator that a DataCounter must provide. We do not use Java's iterator type because doing so obligates us to certain rules that are difficult to implement properly (if curious, read about "concurrent modification exceptions").
Comparator.java
The type for function objects that do comparisons between pairs of some element type. Constructors of DataCounter implementations should take a Comparator as an argument, as illustrated in BinarySearchTree.
BinarySearchTree.java
An implementation of DataCounter using a (you guessed it) binary search tree. You should not modify this file, but it is a good example of how to do function objects and iterators. The iterators you write will not be as difficult.
WordCount.java
The main file for Phase A. You will need to make several changes to this file. It processes a file and prints out the words from the file in frequency order. More detail is provided in the description below. In Phase B, you will write a different main class.
TestBinarySearchTree.java
A class containing JUnit tests for BinarySearchTree. This is just an example that may or may not help to inspire you while writing test files for the data structures you implement.
GStack.java
Same generic stack interface you saw in Project 1.
FileWordReader.java
Handles input, converting each word into lower-case and removing punctuation.
DataCountStringComparator.java
An implementation of Comparator that orders two DataCount<String> objects in the proper order for Phase A. It requires that you you correctly implement StringComparator, as detailed below.
PriorityQueue.java
Interface your heap will implement. Your heap should use a function object for comparisons, just like the DataCounter implementations.
Hasher.java
Interface for a function object your hashtable implementation will want to take as a parameter.

Phase A Programming

Running the WordCount program should output the frequency of each word in the document, starting with the most frequent words and resolving ties in alphabetical order. For example, output might look like this:

     970     the 
     708     and 
     666     of 
     632     to 
     521     at 
     521     i 
     521     into 
     466     a 
     444     my 
     391     in 
     383     buffalo 
     ...

Note the actual printing code, along with many other parts, are provided. Getting a correct version of the output will require only a few additions. You will then implement different data-counters and sorting routines to see different ways to solve this problem. We next explain all the command-line options you need to support, but we do not recommend that you start by implementing all of these. Instead, go one step at a time, as outlined below, and add command-line options when it is convenient.

Command-Line Arguments

Overall, the WordCount program you submit for Phase A takes arguments as follows:

java WordCount [ -b | -a | -m | -h ] [ -is | -hs | -os | -k <number>] <filename>

For example, at the end of phase A, your code should work with command lines like the following:

java WordCount -b -is filename

java WordCount -a -hs filename

You will need to modify Wordcount.java to process these parameters. As provided, it only processes a single parameter for the filename.

Getting Started

For the code to compile and generate the correct output, you need to do the following:

At this point your program should produce the correct output (implicitly using options -b and -is, since you do not yet process command-line parameters).

Adding Other Data-Counter Implementations

Provide two additional implementations of DataCounter<E> as described below. Provide appropriate JUnit tests for each data structure implementation. Use the appropriate implementation based on the first command-line argument.

Adding a Heap and Heapsort

Provide an implementation of PriorityQueue<E> in class FourHeap<E>.

Now provide a different sorting algorithm using a priority queue. This algorithm is called heapsort and works as follows:

Put your algorithm in a static method heapSort in WordCount.java. For full credit, your sorting routine should be generic like insertionSort (the two should have the same arguments). Note: You do not need to implement an in-place heapsort as was discussed in lecture.

Adjust WordCount to use your heapSort method when the second command-line argument is -hs.

Submission

Submit all your new files, named exactly as follows:

Make sure your code is properly documented (recall the Programming Guidelines). Note that the write-up questions are all submitted with Phase B, but some of them refer to work you did in Phase A. You may wish to start on them early.

Phase B Programming

In Phase B, you will provide an additional counter implementation, an additional sorting implementation, and support for printing only the k most-frequent words in the file. You will then write a different main program that computes the similarity of two files. Finally, you will perform some simple experimentation to determine which implementations are fastest. In this phase, we purposely give somewhat less guidance on how to organize your code and implement your algorithms. It is up to you to make good design decisions.

Another Counter and Sorter

Include the two implementations described below, together with appropriate JUnit tests.