Last Updated: April 17, 2010

Shakespeare CSE332 Spring 2010: Project 2 - Shake-n-Bacon Bacon

Outline

Due Dates and Turn-In

Turn in your assignment here

Preliminaries: Partners and Code

You are encouraged (although not required) to work with a partner of your own choosing for this project. You may also use the message board or the TAs to help find a partner. No more than two students total can be on a team. You may divide the work however you wish, under three conditions:

Test all your code together to make sure that your portions interact properly! Do not attempt to merge your code on the project due date. You will very likely have problems. Also, be aware that except in extreme cases when you notify us in advance of the deadline, team members will receive the same project grade.

You need several files provided to you in this single zip file:

  project2files.zip

Create a new Eclipse project with these files. The files are described in more detail below, including how the code is organized and how to find additional interesting tests. Note the provided code will not compile or run correctly until you complete the first part of Phase A.

Introduction and Word-Frequency Analysis

You have just been approached by a world famous UW history professor. He would like you to settle a very old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside your area of expertise. "Oh, no," chuckles the historian, stroking his snowy white beard. "I need a Computer Scientist! Someone who can efficiently compare any two documents in terms of their word frequency."

Authors tend to use some words more often than others. For example, Shakespeare used "thou" more often than Bacon. The professor believes a "signature" can be found for each author, based on frequencies of words found in the author's works, and that this signature should be consistent across the works of a particular author but vary greatly between authors. He is tasking you with (in Phase B) computing a numeric value that quantifies the "similarity" of two texts. In Phase A, you will work on the simpler and helpful subproblem of computing the frequency of words in one text.

Overall, perhaps your work can be applied to the works attributed to Shakespeare and Bacon to settle the ancient debate. The professor has provided you with copies of Shakespeare's (Hamlet) and Bacon's writing (The New Atlantis), which he has painstakingly typed by hand, from his antique, leather-bound first-editions. Being good scientists, however, you quickly realize that it is impossible to draw strong conclusions based on so little data, and asking him to type more books is out of the question! Thus, for Phase B you should download and analyze several more works, as many works as you feel is necessary to support your conclusion. Project Gutenberg is a good place to look.

When dealing with document correlations, it is often desirable to work only with the roots of words. That way, "sleeps", "sleeping", and "sleep" are all considered to be the same word. This process is called word stemming, and is used in search engines. However, proper stemming is difficult. Dropping an 's' from the end of a word seems natural but turns "bus" into "bu". Dropping punctuations seems like a good idea but it equates "it's" and "its". Implementing a decent stemming algorithm (like Porter Stemming) is above and beyond work.

The code provided to you does not really do stemming, but it does normalize the input as follows:

Hence every string you process will contain just the 26 lowercase English letters though this doesn't really affect the project.

Learning Objectives

For this project, you will:

The Assignment

This assignment consists of two phases. Each phase involves programming including JUnit tests. Phase B also involves collecting some experimental data and completing write-up questions.


Code Provided To You

The code provided to you is (believed) correct (let us know of problems), but you will need to add several other files as well as make changes to WordCount.java. As Phase A explains, the code will not compile nor run until you make a few small additions/changes. Here are brief descriptions of the provided files in roughly the order we suggest understanding them.


Phase A Programming

Running the WordCount program should output the frequency of each word in the document, starting with the most frequent words and resolving ties in alphabetical order. For example, output might look like this:

     970     the 
     708     and 
     666     of 
     632     to 
     521     at 
     521     i 
     521     into 
     466     a 
     444     my 
     391     in 
     383     buffalo 
     ...

Note the actual printing code, along with many other parts, are written for you. Getting a correct version of the output will require only a few additions. You will then implement different data-counters and sorting routines to see different ways to solve this problem. We next explain all the command-line options you need to support, but we recommend you not start by implementing all of these. Instead, go one step at a time, as outlined below, adding command-line options when it is convenient.

Command-Line Arguments

Overall, the WordCount program you turn in for Phase A takes arguments as follows:

java WordCount [ -b | -a | -m | -h ] [ -is | -hs | -os ] <filename>

Getting Started

For the code to compile and generate the correct output, you need to do the following:

At this point your program should produce the correct output (options -b and -is).

Adding other Data-Counter Implementations

Provide two additional implementations of DataCounter<E> as described below. For both, provide appropriate JUnit tests. Use the appropriate implementation based on the first command-line argument.

Adding a Heap and Heapsort

Provide an implementation of PriorityQueue<E> in class FourHeap<E>. Your data structure should be like the binary heap we studied except nodes should have four children not two. (Only leaves and at most one other node would have fewer children.) You should use an efficient array-based implementation. Hint: For the node at index i, put the first child at index 4*i - 2. Develop appropriate unit tests.

Now provide a different sorting algorithm using a priority queue. This algorithm is called heapsort and works as follows (we will study it a little later in the course): Insert each element to be sorted in the priority queue. Then remove each element, storing the results in order in the array that should hold the sorted result. (That's it!) Put your algorithm in a static method heapSort in WordCount.java. For full credit your sorting routine should be generic like insertionSort. In fact, the two methods should have the same arguments.

Adjust WordCount to use your heapSort method when the second command-line argument is -hs.

Turn-In

Turn in all your new files, including any additional Java files you created for testing, and WordCount.java (the only provided file you should modify). Make sure your code is properly documented, etc. Note that the write-up questions can all be turned in for Phase B, but some of them refer to work you did in Phase A, so you may wish to start on them early.


Phase B Programming

In Phase B, you will provide an additional counter implementation, an additional sorting implementation, and support for printing only the k most-frequent words in the file. (See also "Above and Beyond" for excluding words that appear in a second file.) You will then write a different main program that computes how similar two files are. Finally, you will perform some simple experimentation to determine which implementations are fastest. In this phase, we purposely give somewhat less guidance on how to organize your code and implement your algorithms, meaning it is up to you to make good design decisions.

This phase adds an additional optional command-line argument to WordCount as follows:

java WordCount [ -b | -a | -m | -h ] [ -is | -hs | -os ] <filename> optional-k

Another Counter and Sorter

Include the two implementations described below and appropriate unit tests.

To use your hashtable for WordCount, you will need to be able to hash strings. Implement your own hashing strategy using charAt and length. That is, do not use Java's hashCode method.

Top k Words

Extend WordCount to take an optional additional integer parameter. When this k is provided, WordCount should print out only the k most frequent words. (If the text has fewer than k distinct words, then print all words' frequencies. If there are ties for the k most frequent words, still resolve them alphabetically so that you print exactly k words.) You should still print the words in the same order: most-frequent first with ties resolved alphabetically.

An easy way to implement this feature would be to sort all the words by frequency and then just print k of them. This approach finds the k most frequent words in time O(n log n) where n is the number of distinct words. You will need to implement this simple approach for comparison purposes in your experimentation (see the write-up questions), but WordCount should be configured to use a more efficient approach that runs in time O(n log k) (assuming k is less than n). Hint: Use a priority-queue, but never put more than k elements into it. You will need a different comparator than you used when implementing heapsort so that you can efficiently track the k most-frequently occurring words seen so far.

Document Correlation

How to quantify the similarity between two texts is a large area of study. We will use a simple approach, leaving more advanced models such as Latent Semantic Indexing to extra credit. Implement this simple approach:

Write a class Correlator with a main method that computes the correlation between two documents. Its only output should be a single floating-point number which is the correlation computed as described above. Command-line flags should specify which DataCounter to use like in WordCount:

java Correlator [ -b | -a | -m | -h ] <filename1> <filename2>

Comparing documents can be fun. Feel free to post links to interesting text examples on the course message board.

Experimentation

The write-up questions ask you to design and run some experiments to determine which implementations are faster for various inputs. Answering these questions will require writing additional code to run the experiments and collect timing information. For example, writing code that runs word-counting for various values of k will save you time compared to manually running your program with different values of k.

Turn-In

Turn in all your new files, including any files you modified after Phase A (probably just WordCount.java) and any files used for testing and experimentation.


Write-Up Questions (Part of Phase B)

Turn in a report answering the following questions. Note that the last 3 questions require writing code, collecting data, and producing results tables and/or graphs along with relatively long answers. So do not wait until the last minute.

  1. Who is in your group?
  2. What assistance did you receive on this project? Include anyone or anything except the group members, the course staff, and the printed textbook.
  3. How long did the project take? Which parts were most difficult? How could the project be better?
  4. What "above and beyond" projects did you implement? What was interesting or difficult about them? Describe how you implemented them.
  5. How did you design your unit tests? What properties do they test? What do they not test? What boundary cases did you consider?
  6. Why does the iterator for a binary search tree need an explicit stack? If the iterator was re-written for use especially with AVL trees, how could you avoid having to resize the stack (possibly changing the GStack interface)?
  7. If a DataCounter's iterator returned elements in the order needed for printing the most-frequent words first, then you would not need a PriorityQueue. For each DataCounter implementation, could such an iterator be implemented efficiently? Explain your answers.
  8. For your Hashtable to be correct (not necessarily efficient), what must be true about the arguments to the constructor?
  9. For WordCount, which DataCounter implementation and sorting implementation tend to produce the fastest results for large input texts? How did you determine this? Are there (perhaps contrived) texts that would produce a different answer, especially considering how MoveToFrontList works? Does changing your hashing function affect your results? Describe the experiments you ran to determine your answer and submit experimental results, probably in a table. Relevant experimental set-up should include at least the texts used and how you collected timing information.
  10. For WordCount, design and conduct experiments to determine whether it is faster to use your O(n log k) approach to finding the top k most-frequent words or the simple O(n log n) approach (using the fastest sort you have available). The relative difference in speed surely depends on k -- produce a graph showing the time for the two approaches for various values of k ranging from very small (i.e., 1) to very large (i.e., n). How could you modify your implementation to take advantage of your experimental conclusions?
  11. Using Correlator, does your experimentation suggest that Bacon wrote Shakespeare's plays? We do not need a fancy statistical analysis; this question is intended to be fun and simple. Give a 1-2 paragraph explanation.

Above and Beyond (Part of Phase B)

You may do any or all of the following; pick ones you find interesting. Number 3 is recommended.

  1. Deletion: Extend the DataCounter interface to support deleting items. Extend your implementations and unit tests appropriately. Do not use lazy deletion, which is easy.
  2. Alternate Hashing Strategies: Implement both closed and open addressing and perform experimentation to compare performance. Also design additional hashing functions and determine which affects performance more: the cost of hashing, the collision-avoidance of hashing, or your addressing strategy.
  3. Excluded Words: Adjust WordCount and Correlator to take an additional file that contains words that should be ignored. For example, the file might contain very common words (like the, a, and she) or it might be another text (so WordCount would then report words not appearing in the other text). There are multiple ways to implement this feature: you could skip words-to-be-ignored as you process the input file, or when sorting the elements, or when printing them. Which is fastest and why do you think this is? What data structure did you use to determine efficiently if a word should be ignored and why is this efficient? Note that for top-k, you should still print k words (assuming there are k words that should not be ignored). How does this affect the ways you can implement this skip-some-words feature?
  4. Profiling: Profile your program using hprof. A profiler is a tool that enables a programmer to obtain detailed information about how a program performs, such as the number of times a function is called, or how much of the program's runtime was spent in a particular function call. Compare two tree or two hash table implementations using a profiler. Discuss what performance bottlenecks you found, whether they surprised you, and how you might be able to improve performance by focusing on the bottlenecks.
  5. Introspective Sort: Introspective sort is an unstable quicksort variant which switches to heapsort for inputs which would result in a O(n2) running-time for normal quicksort. Thus, it has an average-case and a worst-case runtime of O(n log n), but generally runs faster than heapsort even in the worst case. Implement IntroSort, and give a sample input which would result in a quadratic runtime for normal quicksort (using a median-of-3 partitioning scheme).
  6. Word Stemming: The introduction discussed how it is difficult to remove suffixes, plurals, verb tenses, etc., from words, but it is important when performing text analysis (e.g., for web search engines). One common algorithm is Porter Stemming. Implement this algorithm. Try to do so without looking at the source code posted at the link. If you do look at their source code, cite it properly.
  7. Word Co-Occurrence: A phenomenon even more interesting than word frequency is word co-occurrence. Create a new WordCount that counts the frequency of pairs of words. Your code should insert as a pair any two words which appear within c words of each other, where c is a command-line argument (try 10 for test runs).
  8. Better Correlation Models: Research better ways to quantify the correlation between two text documents and what their advantages/disadvantages are. Describe what you found and implement an algorithm more sophisticated than the simple one in Correlator. Explain how your algorithm affects your conclusions.

Interesting Tidbits

Acknowledgments

A variant of this project was the third and final project in CSE326 for many years. The first word counter appeared during Spring 2000 when 326 was taught by the famous Steve Wolfman. The snowy-bearded historian appeared in Autumn 2001, courtesy of Adrien Treuille. The assignment was subsequently tweaked over the years by Matt Cary, Raj Rao, Ashish Sabharwal, Ruth Anderson, David Bacon, Hal Perkins, and Laura Effinger-Dean. Dan Grossman adapted it to be the second project in CSE332.

Valid CSS! Valid XHTML 1.1