You can turn it in using the catalyst Drop-box linked on our course web page.
You are encouraged (although not required) to work with a partner of your own choosing for this project. You may use the course message board to find potential partners. No more than two students total can be on a team. You may divide the work however you wish, under three conditions:
Other logistics:
Test all your code together to make sure that your portions interact properly! Do not attempt to merge your code on the project due date. You will very likely have problems. Also, be aware that except in extreme cases when you notify us in advance of the deadline, team members will receive the same project grade. You may wish to set up an SVN repository to make collaboration easier; if so, contact one of the course staff about getting access.
If you plan on working in pairs one partner MUST send Sandra: (1) the full names of both partners, and (2) the cse email addresses of both partners, by 11pm Tues Jan 25th at the very latest (It is strongly recommended that you select a partner much earlier!). Please use this form to send your pairs information.
You need several files provided to you in this single zip file:
Create a new Eclipse project with these files. The files are described in more detail below, including how the code is organized and how to find additional interesting tests. Note the provided code will not compile or run correctly until you complete the first part of Phase A.
You have just been approached by a world famous UW history professor. He would like you to settle a very old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside your area of expertise. "Oh, no," chuckles the historian, stroking his snowy white beard. "I need a Computer Scientist! Someone who can efficiently compare any two documents in terms of their word frequency."
Authors tend to use some words more often than others. For example, Shakespeare used "thou" more often than Bacon. The professor believes a "signature" can be found for each author, based on frequencies of words found in the author's works, and that this signature should be consistent across the works of a particular author but vary greatly between authors. He is tasking you with (in Phase B) computing a numeric value that quantifies the "similarity" of two texts. In Phase A, you will work on the simpler and helpful subproblem of computing the frequency of words in one text.
Overall, perhaps your work can be applied to the works attributed to Shakespeare and Bacon to settle the ancient debate. The professor has provided you with copies of Shakespeare's (Hamlet) and Bacon's writing (The New Atlantis), which he has painstakingly typed by hand, from his antique, leather-bound first-editions. Being good scientists, however, you quickly realize that it is impossible to draw strong conclusions based on so little data, and asking him to type more books is out of the question! Thus, for Phase B you should download and analyze several more works, as many works as you feel is necessary to support your conclusion. Project Gutenberg is a good place to look.
When dealing with document correlations, it is often desirable to work only with the roots of words. That way, "sleeps", "sleeping", and "sleep" are all considered to be the same word. This process is called word stemming, and is used in search engines. However, proper stemming is difficult. Dropping an 's' from the end of a word seems natural but turns "bus" into "bu". Dropping punctuations seems like a good idea but it equates "it's" and "its". Implementing a decent stemming algorithm (like Porter Stemming) is above and beyond work.
The code provided to you does not really do stemming, but it does normalize the input as follows:
Hence every string you process will contain just the 26 lowercase English letters though this doesn't really affect the project.
For this project, you will:
DataCounter
, a variant of
the Dictionary
ADT, and understand the
strengths of various DataCounter
implementations
This assignment consists of two phases. Each phase involves programming including JUnit tests. Phase B also involves collecting some experimental data and completing write-up questions.
The code provided to you is (believed) correct (let us know of
problems), but you will need to add several other files as well as make changes
to WordCount.java
.
As Phase A explains, the code will not compile nor run until you make
a few small additions/changes. Here are brief descriptions of the provided files in
roughly the order we suggest understanding them.
DataCounter.java
The
DataCounter
interface (the second interface in the file)
is like a dictionary except it maintains a count for each data item.
We do not require the element type E
to be
"comparable"; instead constructors for implementations will
take function objects of type Comparator
to do
comparisons. Also notice that a DataCounter
provides an
iterator SimpleIterator<DataCount<E>>
; the
type DataCount
is simple and in this file.
SimpleIterator.java
The interface
type SimpleIterator
is for the iterator that a
DataCounter
must provide. We do not use Java's iterator
type because doing so obligates us to certain rules that are difficult
to do properly (if curious, read about "concurrent modification
exceptions").
Comparator.java
The type for
function objects that do comparisons over some element type. Have
constructors of DataCounter
implementations take
a Comparator
as an argument, just as
in BinarySearchTree
.
BinarySearchTree.java
An
implementation of DataCounter
using a (you guessed it)
binary search tree. You should not modify this file, but it is a good
example of how to do function objects and iterators. (The iterators
you will write will not be as difficult.)
WordCount.java
The main file for
Phase A. You will need to make several changes to this file. It
processes a file and prints out the words in the file in frequency
order; more on this in the "Phase A programming" description below.
In Phase B, you will write a different main class on your own.
TestBinarySearchTree.java
A class
containing JUnit tests for BinarySearchTree
. This is
just an example that may or may not help inspire you while writing
test files for the data structures you implement.
GStack.java
Same generic stack interface you saw in Project 1.
FileWordReader.java
Does input, converting each word to be lower-case and
free of punctuation -- no need to understand it.
DataCountStringComparator.java
An
implementation of Comparator
that orders two
DataCount<String>
objects in the proper order for Phase A, assuming you implement
StringComparator
correctly (see below).
PriorityQueue.java
Interface your
heap will implement. Your heap will use a function object for
comparisons, just like the DataCounter
implementations.
Hasher.java
Interface for a second function object your hashtable implementation
will want to take as a parameter (Phase B).
Running the WordCount
program should output the frequency
of each word in the document, starting with the most frequent words
and resolving ties in alphabetical order. For example, output might look like this:
970 the 708 and 666 of 632 to 521 at 521 i 521 into 466 a 444 my 391 in 383 buffalo ...
Note the actual printing code, along with many other parts, are written for you. Getting a correct version of the output will require only a few additions. You will then implement different data-counters and sorting routines to see different ways to solve this problem. We next explain all the command-line options you need to support, but we recommend you not start by implementing all of these. Instead, go one step at a time, as outlined below, adding command-line options when it is convenient.
Overall, the WordCount
program you turn in for Phase A takes arguments as follows:
java WordCount [ -b | -a | -m | -h ] [ -is | -hs | -os ] <filename>
DataCounter
implementation to use: -b
for (unbalanced) binary search
tree, -a
for AVL tree, -m
for move-to-front
list, or -h
for hashtable. Because hashtable is part of
Phase B, print an appropriate error message for now.
-is
for insertion sort (given to you, once you
provide string-comparison), -hs
for heapsort, or
-os
for other. Again the last option is part of Phase B,
so for now print an appropriate error message for this option.
For example, at the end of phase A, your code should work with command lines like the following:
java WordCount -b -is filename
java WordCount -a -hs filename
For the code to compile and generate the correct output, you need to do the following:
GArrayStack
class. (The iterator for BinarySearchTree
uses it.)
You can just copy this file from your project 1 solution.
StringComparator
class that
implements Comparator<String>
. This comparator is
used by the provided code both for data-counters and for sorting.
Because of how the output must be sorted in the case of ties, your
implementation should return a negative number when the first argument
to compare
comes first alphabetically. Do not use
any String
comparison provided in Java's standard
library; the only String
methods you should use
are length
and charAt
.
getCountsArray
in WordCount
. The provided code returns with an error.
Your code should use the argument's iterator to retrieve all the
elements and put them in a new array. That is, the code you write
is using (not implementing) a SimpleIterator
.
At this point your program should produce the correct output
(options -b
and -is
).
Provide two additional implementations
of DataCounter<E>
as described below. For both,
provide appropriate JUnit tests. Use the appropriate implementation
based on the first command-line argument.
AVLTree<E>
that implements DataCounter<E>
using AVL trees.
You only need to provide the methods defined
in DataCounter
. Your implementation must be a
subclass of BinarySearchTree<E>
and must use
inheritance and calls to superclass methods to avoid unnecessary
duplication or copying of functionality. Hints: Create a subclass
of BSTNode
. Because all tree nodes will have to be
instances of this new subclass, you will need to use overriding to
replace the incCount
method.
MoveToFrontList<E>
that implements DataCounter<E>
using a linked list as follows:
getCount
, move it to the
front of the list. That means you delete the node from its
current list position and make it the first node in the list.
While this data structure has O(n) worst-case time operations, it does well when some words are much more frequent than others because the frequent ones will end up near the beginning of the list. The write-up questions will explore this a bit further.
Provide an implementation of PriorityQueue<E>
in
class FourHeap<E>
. Your data structure should be
like the binary heap we studied except nodes should
have four children not two. (Only leaves and at most
one other node would have fewer children.) You should use an
efficient array-based implementation. Hint: For the node at
index i
, put the first child at index 4*i -
2
. Develop appropriate (JUnit) unit tests.
Now provide a different sorting algorithm using a priority queue.
This algorithm is called heapsort and works as follows (we will
study it a little later in the course): Insert each element to be
sorted in the priority queue. Then remove each element, storing the
results in order in the array that should hold the sorted result.
(That's it!) Put your algorithm in a static
method heapSort
in WordCount.java
. For full
credit your sorting routine should be generic
like insertionSort
. In fact, the two methods should have
the same arguments.
Adjust WordCount
to use your heapSort
method when the second
command-line argument is -hs
.
Turn in all your new files (Please name them exactly as follows: GArrayStack.java,
StringComparator.java, AVLTree.java, MoveToFrontList.java,
FourHeap.java
), including any additional Java files you created
for testing, and WordCount.java
(the only provided file
you should modify). Make sure your code is properly documented, etc.
Note that the write-up questions can all be turned in for Phase B, but
some of them refer to work you did in Phase A, so you may wish to
start on them early.
(NEW - Files submitted for phase A will be graded for functional
correctness. This will contribute 10% to your overall grade for the
project. For Phase B we will ask that you resubmit *all* files
associated with the project. All files will be evaluated for
correctness and style at that time. It will be allowable to make
modifications to files you submitted at the phase A checkpoint and
resubmit them at the phase B final turn-in, they will be re-evaluated
at the phase B check-in point.)
In Phase B, you will provide an additional counter implementation, an additional sorting implementation, and support for printing only the k most-frequent words in the file. (See also "Above and Beyond" for excluding words that appear in a second file.) You will then write a different main program that computes how similar two files are. Finally, you will perform some simple experimentation to determine which implementations are fastest. In this phase, we purposely give somewhat less guidance on how to organize your code and implement your algorithms, meaning it is up to you to make good design decisions.
This phase adds an additional optional command-line
argument to WordCount
as follows:
java WordCount [ -b | -a | -m | -h ] [ -is | -hs | -os ] <filename> optional-k
For example, your code should work with command lines like the following:
java WordCount -b -os filename
java WordCount -a -is filename 25
java WordCount -m -hs filename 10
Include the two implementations described below and appropriate JUnit tests.
Hashtable<E>
that implements DataCounter<E>
using a hashtable.
You only need to provide the methods defined
in DataCounter
. You can implement any kind of hashtable
discussed in class; the only restriction is that it should not
restrict the size of the input domain or the number of inputs (i.e.,
your hash table must grow). Hint: To make your hashtable generic
using the function-object approach, have the constructor take
a Comparator<E>
and a Hasher<E>
.
You should use this implementation with the -h
option.
-os
option.
As with insertionsort (provided) and heapsort, your code should be generic.
To use your hashtable for WordCount
, you will need to
be able to hash strings. Implement your own hashing strategy
using charAt
and length
. That is,
do not use Java's hashCode
method.
Extend WordCount
to take an optional additional
integer parameter. When this k is provided,
WordCount
should print out only the k most
frequent words. (If the text has fewer than k distinct words,
then print all words' frequencies. If there are ties for the k
most frequent words, still resolve them alphabetically so that you
print exactly k words.) You should still print the words in
the same order: most-frequent first with ties resolved alphabetically.
An easy way to implement this feature would be to sort all the
words by frequency and then just print k of them. This
approach finds the k most frequent words in time O(n log
n) where n is the number of distinct words. You will need
to implement this simple approach for comparison purposes in your
experimentation (see the write-up questions),
but WordCount
should be configured to use a more
efficient approach that runs in time O(n log k)
(assuming k is less than n). Hint: Use a
priority-queue, but never put more than k elements into it.
You will need a different comparator than you used when
implementing heapsort so that you can efficiently track the k
most-frequently occurring words seen so far.
How to quantify the similarity between two texts is a large area of study. We will use a simple approach, leaving more advanced models such as Latent Semantic Indexing to extra credit. Implement this simple approach:
Write a class Correlator
with a main
method that computes the correlation between two documents. Its only
output should be a single floating-point number which is the
correlation computed as described above. Command-line flags should
specify which DataCounter
to use like in WordCount
:
java Correlator [ -b | -a | -m | -h ] <filename1>
<filename2>
For example, your correlator should work with command lines like the following:
java Correlator -b filename1 filename2
java Correlator -h filename1 filename2
Comparing documents can be fun. Feel free to post links to interesting text examples on the course message board.
The write-up questions ask you to design and run some experiments to determine which implementations are faster for various inputs. Answering these questions will require writing additional code to run the experiments and collect timing information. For example, writing code that runs word-counting for various values of k will save you time compared to manually running your program with different values of k.
(NEW - For phase B please turn in ALL of your files associated with
the project. This includes all files from phase A. You are free to
modify files you submitted in Phase A if you wish. Files submitted
for phase A will be graded for functional correctness. *All* files
will be graded for correctness and style at the phase B turn-in
point.)Turn in all your new files, (this includes at
least: Hashtable.java, Correlator.java
) including any
files you modified after Phase A (probably
just WordCount.java
) and any files used for testing and
experimentation.
Turn in a report answering the following questions. Note that the last 3 questions require writing code, collecting data, and producing results tables and/or graphs along with relatively long answers. So do not wait until the last minute!
GStack
interface)?
DataCounter
's iterator returned elements
in the order needed for printing the most-frequent words first, then
you would not need a PriorityQueue
. For
each DataCounter
implementation, could such an iterator
be implemented efficiently? Explain your answers.
Hashtable
to be correct (not
necessarily efficient), what must be true about the arguments
to the constructor?
WordCount
, which DataCounter
implementation and sorting implementation tend to produce the fastest
results for large input texts? How did you determine this? Are there
(perhaps contrived) texts that would produce a different answer,
especially considering how MoveToFrontList
works? Does
changing your hashing function affect your results? Describe the
experiments you ran to determine your answer and submit experimental
results, probably in a table. Relevant experimental set-up should
include at least the texts used and how you collected timing
information.
WordCount
, design and conduct experiments
to determine whether it is faster to use your O(n log k)
approach to finding the top k most-frequent words or the
simple O(n log n) approach (using the fastest sort you have
available). The relative difference in speed surely depends
on k -- produce a graph showing the time for the two approaches
for various values of k ranging from very small (i.e., 1) to
very large (i.e., n). How could you modify your implementation
to take advantage of your experimental conclusions?
Correlator
, does your experimentation
suggest that Bacon wrote Shakespeare's plays? We do not need a fancy
statistical analysis; this question is intended to be fun and simple.
Give a 1-2 paragraph explanation.
You may do any or all of the following; pick ones you find interesting. Number 3 is recommended. Submit your 'above and beyond' files separately in a zip file.
WordCount
and Correlator
to take an additional file that contains
words that should be ignored. For example, the file might contain very common words (like the, a, and she)
or it might be another text (so WordCount
would then report words not appearing in the other text). There are
multiple ways to implement this feature: you could skip words-to-be-ignored as you process the input file, or when
sorting the elements, or when printing them. Which is fastest and why do you think this is? What data structure did you use
to determine efficiently if a word should be ignored and why is this efficient? Note that for top-k, you should still print
k words (assuming there are k words that should not be ignored). How does this affect the ways you can implement this
skip-some-words feature?
WordCount
that counts the frequency of pairs of words. Your code should insert as a pair any
two words which appear within c words of each other, where c is a command-line argument (try 10 for test runs).
Correlator
. Explain how your algorithm affects your conclusions.
A variant of this project was the third and final project in CSE326 for many years. The first word counter appeared during Spring 2000 when 326 was taught by the famous Steve Wolfman. The snowy-bearded historian appeared in Autumn 2001, courtesy of Adrien Treuille. The assignment was subsequently tweaked over the years by Matt Cary, Raj Rao, Ashish Sabharwal, Ruth Anderson, David Bacon, Hal Perkins, and Laura Effinger-Dean. Dan Grossman adapted it to be the second project in CSE332.