Last Updated: October 10, 2013
Outline
Due Dates and Turn-In
- Find a partner: As soon as possible, you must
notify us before Wednesday October 16 if you plan on working with a partner
- Phase A:
Wednesday October 23 Friday Oct 25, 11PM
- Phase B: Wednesday November 06, 11PM
Use the course dropbox
to submit your assignment electronically.
Working with a Partner
You are strongly encouraged, but not required, to work with a partner
of your own choosing for this project. You may work with somebody you
already know or use the course message board to find a
partner. No more than two students total may form a team. You may
divide the work however you wish, under three conditions:
- You document each team member's effort in your write-up.
- You work together and make sure you both understand the answers to all write-up questions.
- You understand how your partner's code is structured and how it works.
Other logistics:
- Turn-in only one submission per team (i.e., only one set of
Phase A code, only one set of Phase B code, only only one
Writeup). It would help the course staff if the same partner
submits all of these via their Catalyst account, but this is not
required.
- Both names should appear in all files, although it is fine to attribute
one partner as a primary author on a particular piece of code.
Test all of your code together to be sure it properly integrates.
Start this early, and do not attempt to merge your code at the last
minute as you will very likely experience problems with integration.
You may wish to set up an SVN repository to make collaboration easier.
If so, contact the course staff regarding project space.
Team members will receive the same project grade unless there is an extreme circumstance
and you notify us in advance of the deadline.
If you plan to work with a partner, one partner MUST send HyeIn an email using this link. In your email, provide the following information for BOTH partners: full name, UWNetID, CSE Email address.
Introduction and Word-Frequency Analysis
You have just been approached by a world famous UW history professor.
He would like you to settle a very old debate
on who wrote Shakespeare's plays, Shakespeare
or Sir Francis Bacon?
You protest that this question is surely outside your area of expertise. "Oh, no,"
chuckles the historian, stroking his snowy white beard.
"I need a Computer Scientist! Someone who can efficiently compare any two
documents in terms of their word frequency."
Authors tend to use some words more often than others. For example,
Shakespeare used "thou" more often than Bacon. The professor
believes a "signature" can be found for each author, based on
frequencies of words found in the author's works, and that this signature
should be consistent across the works of a particular author but vary
greatly between authors.
In Phase A, you will work on the simpler and helpful subproblem of computing the frequency of words in a text.
In Phase B, you will compute a numeric value that quantifies the "similarity" of two texts.
Perhaps your work can be applied to the works attributed to
Shakespeare and Bacon to settle the ancient debate. The professor has
provided you with copies of Shakespeare's writing
(Hamlet) and Bacon's writing
(The New Atlantis), which he has
painstakingly typed by hand from his antique, leather-bound
first-editions. Being good scientists, however, you quickly realize
that it is impossible to draw strong conclusions based on so little
data, and asking him to type more books is out of the question!
Thus, for Phase B you should download and analyze several more works,
as many works as you feel is necessary to support your conclusion.
Project Gutenberg is a good place to look.
When working with document correlations, it is often desirable to
work only with the roots of words. In this way, "sleeps",
"sleeping", and "sleep" are all considered to be
the same word. This process is called word stemming, and is
used in search engines and many other contexts. However, proper stemming is
difficult. Dropping an 's' from the end of a word seems natural but
turns "bus" into "bu". Dropping punctuation
seems like a good idea but it equates "it's" and
"its". Implementing a decent stemming algorithm
(such as Porter Stemming)
is above and beyond work for this project.
The code provided to you does not really do stemming, but it does normalize the input as follows:
- It converts all letters to lower-case, so "An" and "an" are the same word.
- It removes all punctuation from words, despite this being occasionally wrong.
Hence every string you process will contain just the 26 lowercase English letters
(although this doesn't deeply impact the project).
Learning Objectives
For this project, you will:
- Be introduced to DataCounter, a variant of the Dictionary
ADT, and understand the strengths of various DataCounter implementations
- Gain proficiency with heaps, AVL trees, and hashtables
- Implement and use two idioms related to writing generic and reusable code: function objects and iterators
- Implement sorting algorithms
- Gain experience with unit testing and the JUnit tool
- Gain preliminary experience with timing code and evaluating experimental data
The Assignment
This assignment consists of two phases.
Each phase includes implementation of data structures, JUnit tests, and programs manipulating those structures.
Phase B also involves collecting some experimental data and completing write-up questions.
A large part of your work will involve using the WordCount
program. Running the WordCount program should output the
frequency of each word in the document, starting with the most
frequent words and resolving ties in alphabetical order. For example,
output might look like this:
970 the
708 and
666 of
632 to
521 at
521 i
521 into
466 a
444 my
391 in
383 buffalo
...
Note that the actual printing code, along with many other parts, are
provided. You will implement different data-counters and sorting
routines to see different ways to solve this problem. We will explain
all the command-line options you need to support, but we do not
recommend that you start by implementing all of these. Instead, go
one step at a time, and add command-line options when it is
convenient. Overall, the WordCount program takes arguments
as follows:
java WordCount [ -b | -a | -m | -h ] [ -is | -hs | -os | -k <number>] <filename>
-
The first argument is which DataCounter implementation to use:
-b for unbalanced binary search tree,
-a for AVL tree,
-m for move-to-front list, or
-h for hashtable.
Because hashtable is part of Phase B, your Phase A submission can ignore this
parameter or print an appropriate error message.
-
The second argument is which sorting method to use:
-is for insertion sort (given to you, once you provide string-comparison),
-hs for heapsort,
-os for other, or
-k followed by a number for printing only the top-k words rather than sorting all of them.
Because -os and -k are part of Phase B, your Phase A submission can ignore these
parameters or print an appropriate error message.
-
The third argument is the input file to process.
-
Your WordCount should work as specified when given correct input
parameters in the order listed above. If given incorrect or an
incomplete set of parameters, its behavior is unspecified. (So we
will not be expecting your code to check for and handle all other
combinations of parameters.)
For example, at the end of phase A, your code should work with command lines like the following:
java WordCount -b -is filename
java WordCount -a -hs filename
You will need to modify Wordcount.java to process these parameters.
As provided, it only processes a single parameter for the filename.
Provided Code
Download the project to get started:
project2.zip
Unzip and place the project2 folder in your workspace folder. From eclipse import project2
(right click on package explorer -> import -> General -> Existing projects into workspace -> select project2 as root directory).
The files are described below, including how the code is organized.
Note the provided code will not compile or run correctly until you complete the first part ("1. Getting Started") of Phase A.
We believe the code provided to you is correct (let us know of
problems). You will need to implement classes
in phaseA, phaseB, testA and testB
packages and make several changes to WordCount.java. Here
are brief descriptions of the main provided files in roughly
the order we suggest understanding them.
- DataCount.java
- A simple container for an object and an associated count.
- DataCounter.java
- Like a dictionary, except it maintains a count for each data item (instead of storing an arbitrary value).
We do not require the element type E to be "comparable".
Instead, constructors for implementations will take function objects of type Comparator
and use them to do comparisons. Also notice that a DataCounter provides an
iterator SimpleIterator<DataCount<E>>.
- SimpleIterator.java
- The iterator that a DataCounter must provide. We do not use Java's iterator
type because doing so obligates us to certain rules that are difficult
to implement properly (if curious, read about "concurrent modification
exceptions").
- Comparator.java
- The type for function objects that do comparisons between pairs of some element type.
Constructors of DataCounter implementations should take a Comparator
as an argument, as illustrated in BinarySearchTree.
- BinarySearchTree.java
- An implementation of DataCounter using a (you guessed it)
binary search tree. You should not modify this file, but it is a good
example of how to do function objects and iterators. The iterators
you write will not be as difficult.
- WordCount.java
- The primary file for Phase A. You will need to make several
changes to this file. It processes a text file and prints out the words
from the file in frequency order. More detail is provided in the
description below.
- TestBinarySearchTree.java
- A class containing JUnit tests for BinarySearchTree. This is
just an example that may or may not help to inspire you while writing
test files for the data structures you implement.
- GStack.java
- Same generic stack interface you saw in Project 1.
- FileWordReader.java
- Handles input, converting each word into lower-case and removing punctuation.
- DataCountStringComparator.java
- An implementation of Comparator that orders two DataCount<String>
objects in the proper order for Phase A.
It requires that you you correctly implement StringComparator, as detailed below.
- PriorityQueue.java
- Interface your heap will implement. Your heap should use a function object for
comparisons, just like the DataCounter implementations.
- Hasher.java
- Interface for a function object your hashtable implementation will want to take as a parameter.
Note that you have been provided several other files in addition to
the ones listed above. These other files are mostly full of stubs
that you must fill in and contain instructions in the comments inside
of those files. Please do NOT modify the structure that has been
provided in terms of packages, filenames, etc.
Phase A Programming
1. Getting Started
For the code to compile and generate the correct output, you need to
implement following:
-
GArrayStack: It is needed by the iterator for BinarySearchTree. You can probably just use your code from Project 1.
-
StringComparator: It is used by the provided code for both data-counters and sorting.
Because of how the output must be sorted in the case of ties, your
implementation should return a negative number when the first argument
to compare comes first alphabetically. Do not use any String
comparison provided in Java's standard library; the only String
methods you should use are length and charAt.
-
Implement the static method getCountsArray in WordCount.
The provided code returns with an error.
Your code should use the argument's iterator to retrieve all the
elements and put them in a new array. The code you write
is using (not implementing) a SimpleIterator.
If you have trouble with casting, take another look at these notes on generic arrays.
At this point your program should produce the correct output (implicitly using options -b
and -is, since you do not yet process command-line parameters).
2. Adding Other Data-Counter Implementations
Provide two additional implementations of DataCounter as
described below. You should provide the methods defined
in DataCounter. Follow the instructions & hints found inside
of each file listed below.
-
AVLTree: A self-balancing search tree, subclass
of BinarySearchTree
-
MoveToFrontList: Type of linked list where new items
are inserted at the front of the list, and an existing item gets
moved to the front of the list whenever it is referenced. Although
it has O(n) worst-case time operations, it does well when some
words are much more frequent than others (because the frequent
ones will end up near the beginning of the list).
Adjust WordCount to use your new DataCounter
implementations with the appropriate command-line
arguments: -a for AVL tree, and -m for
MoveToFrontList. Note that the write-up questions are all submitted
with Phase B, but some of them refer to work you did in Phase A. You
may wish to start on them early.
3. Adding a sorting algorithm (heapSort)
Now provide a different sorting algorithm called heapSort.
heapSort needs to use a priority queue, so you will also need
to provide an implementation of one of those as well. In particular,
you should provide an implementation of PriorityQueue in
class FourHeap as follows:
-
FourHeap: Implement FourHeap, a PriorityQueue with 4
children. (Hint: Complete written homework #2 before attempting
this). See the provided FourHeap
and PriorityQueue files for more details.
Once you have a working PriorityQueue, you can use it to
implement heapSort. heapSort works as follows:
- Insert each element to be sorted into the priority queue.
- Then remove each element, storing the results in order in the
array that should hold the sorted result.
Put your algorithm in a static method called heapSort
in WordCount.java. For full credit, your sorting routine
should be generic like insertionSort (the two should have the
same arguments). Note: You do not need to implement
an in-place heapsort as was discussed in lecture.
Adjust WordCount to use your heapSort method when
the second command-line argument is -hs.
4. JUnit Tests
Implement test files in package testA: TestAVLTree, TestMoveToFrontList, TestFourHeap. TestBinarySearchTree is already implemented.
Phase B Programming
In Phase B, you will provide an additional DataCounter
implementation, an additional sorting implementation, and support for
printing only the k most-frequent words in the file. You will
then write a different main program that computes the similarity of
two files. Finally, you will perform some simple experimentation to
determine which implementations are fastest. In this phase, we
purposely give somewhat less guidance on how to organize your code and
implement your algorithms. It is up to you to make good design
decisions.
1. Another Counter and Sorter
You need to provide only the methods defined
in DataCounter. Follow the instructions & hints at the top of
each provided file.
-
HashTable: You may implement any kind of HashTable
discussed in class. For more details see the comments
inside HashTable.java.
-
StringHasher: To use your hashtable for WordCount,
you will need to be able to hash strings. Implement your own
hashing strategy using charAt and length.
Do not use Java's hashCode method.
-
Sorting: Implement either quicksort or mergesort and use
your implementation with the -os option. As
with insertionSort and heapSort, your sorting
code should be generic.
2. JUnit Tests
Implement test files in package
testB: TestHashTable, TestOtherSort.
3. Top k Words
Extend WordCount such that if the second argument
is -k, then the third argument is a number (and the fourth
argument is the filename). When the -k option is used,
WordCount should print out only the k most
frequent words. If the text has fewer than k distinct words,
then print all word frequencies. If there are ties for the k
most frequent words, resolve them alphabetically so that you
still print exactly k words. You should still print the words in
the same order: most-frequent first with ties resolved alphabetically.
An easy way to implement this feature would be to sort all the words
by frequency and then just print k of them. This approach
finds the k most frequent words in time O(n log n),
where n is the number of distinct words. You will need to
implement this simple approach for comparison purposes in your
experimentation (see
the write-up
questions ), but WordCount should be configured to use a
more efficient approach that runs in time O(n log k)
(assuming k is less than n).
Helpful Hints:
-
Use a priority-queue, but never put more than k elements into it.
-
Efficiently tracking the k most-frequently occurring words will require
a different comparator than you used when implementing heapSort.
4. Document Correlation
How to quantify the similarity between two texts is a large area of study.
We will use a simple approach, leaving more advanced models such as
Latent Semantic Indexing
to Above and Beyond. Implement this simple approach:
-
Calculate word counts for the two documents and use each
document's length to create normalized frequencies so that
frequencies can be meaningfully compared between documents of
different lengths (i.e., use frequency percentages or fractions).
-
Ignore words whose frequencies are too high or too low to be
useful. A good starting point is to remove words with normalized
frequencies above 0.01 (1%) or below 0.0001 (0.01%) in both documents.
Feel free to play with these percentages when determining who wrote
Shakespeare's plays, but for grading purposes report results and submit code using these percentages.
-
For every word that occurs in both documents, take the
difference between the normalized frequencies, square that difference,
and add the result to a running sum. This should be done using a single iterator.
-
The final value of this running sum will be your difference
metric. This metric corresponds to the square of the Euclidean
distance between the two vectors in the space of shared words in the
document. Note that this metric ignores any word that does not appear
in both documents, which is probably the biggest weakness of this metric.
Fill in the class Correlator with a main
method that computes the correlation between two documents. Its only
output should be a single floating-point number which is the
correlation computed as described above. Command-line flags should
specify which DataCounter to use, as in WordCount:
java Correlator [ -b | -a | -m | -h ] <filename1> <filename2>
For example, your correlator should work with command lines like the following:
java Correlator -b filename1 filename2
java Correlator -h filename1 filename2
Comparing documents can be fun. Feel free to post links to interesting text examples on the course message board.
One implementation comparing Hamlet to The New Atlantis computes a similarity of 5.657273669233966E-4. An answer close to this suggests you are probably on the right track.
5. Experimentation (for write-up)
The write-up questions ask you to design and run some experiments
to determine which implementations are faster for various inputs.
Answering these questions will require writing additional code to run
the experiments and collect timing information. For example, writing
code that runs word-counting for various values of k will be
much easier than manually running your program with different values of k.
Write-Up Questions (Due with Phase B)
Submit a README.pdf file, answering the questions in this template README file:
(README.docx,
README.pdf)
Note that the final three questions require writing code, collecting
data, and producing results tables and/or graphs together with
relatively long answers. So do not wait until the last minute.
Insert tables/graphs in README.pdf as appropriate, and you
can copy/paste your timing code at the end of your README.pdf
file (Appendix) if you want. Formatting is not a component of your
grade, as long as the report is clear and readable.
Above and Beyond (Due with Phase B)
You may do any or all of the following; pick ones you find interesting.
Submit your 'above and beyond' files separately in a zip file.
-
Alternate Hashing Strategies: Implement both closed and open addressing and perform experimentation to compare
performance. Also design additional hashing functions and determine which affects performance more: the cost of hashing,
the collision-avoidance of hashing, or your addressing strategy.
-
Excluded Words: Adjust WordCount and Correlator to take an additional file that contains
words that should be ignored. For example, the file might contain very common words
(like the, a, and she) or it might be another text (so WordCount
would then report words not appearing in the other text). There are
multiple ways to implement this feature: you could skip words-to-be-ignored as you process the input file, or when
sorting the elements, or when printing them. Which is fastest and why do you think this is? What data structure did you use
to determine efficiently if a word should be ignored and why is this efficient?
Note that for top-k, you should still print k words (assuming there are k words that
should not be ignored). How does this affect the ways you can implement this skip-some-words feature?
-
Introspective Sort:
Introspective
sort is an unstable quicksort variant which switches
to heapsort for inputs which would result in a O(n2) running-time for
normal quicksort. Thus, it has an average-case and a worst-case
runtime of O(n log n), but generally runs faster than
heapsort even in the worst case. Implement IntrospectiveSort, and
give a sample input which would result in a quadratic runtime for
normal quicksort (using a median-of-3 partitioning scheme).
-
Word Stemming: The introduction discussed how it is difficult to remove suffixes, plurals, verb tenses, etc.,
from words, but it is important when performing text analysis (e.g., for web search engines). One common algorithm is Porter Stemming. Implement this algorithm. Try to do so without
looking at the source code posted at the link. If you do look at their source code, cite it properly.
-
Word Co-Occurrence: A phenomenon even more interesting than word frequency is word co-occurrence.
Create a new WordCount
that counts the frequency of pairs of words. Your code should insert as a pair any
two words which appear within c words of each other, where c is a command-line argument (try 10 for test runs).
-
Better Correlation Models: Research better ways to quantify the correlation between two text documents and what
their advantages/disadvantages are. Describe what you found and implement an algorithm more sophisticated than the simple
one in Correlator. Explain how your algorithm affects your conclusions.
Turn-in Information
- You are allowed to add methods, classes and packages, but you
should not change the names & interfaces of the given packages,
classes and methods. Also do not move the given classes to other
packages. (Adding is ok, deleting & modifying not allowed)
- You should not edit the files in the providedCode
folder.
- Do not change the output (printing) format of WordCount and
Correlator (used for grading).
- Do not override the toString method
of DataCounter (used for grading).
-
Make sure your code is properly documented (recall
the Programming
Guidelines).
Phase A:
- Zip the src folder and turn in src.zip
- Please make sure that this src.zip decompresses its contents into a folder called src, which contains the folders
main, phaseA, phaseB, providedCode, testA, and testB.
- It is fine to have edited phase B stuff at this time, we are only
grading classes in phaseA and testA packages. Just
be sure everything you turned in compiles.
Phase B:
- Zip the src folder and turn in src.zip
- Turn in README.pdf
- Turn in extra.zip (if you did any extra credit)
- Please make sure that this src.zip decompresses its
contents into a folder called src, which contains the
folders main, phaseA, phaseB, providedCode, testA, and testB.
- You should fix bugs and address any design weaknesses in your
Phase A code at this time as well - it will be more thoroughly
graded after Phase B turn in.
- All graphs/tables/code needed to answer the writeup questions
should be pasted into README.pdf.
- All extra credit stuff should be in extra.zip, do not
place any extra credit stuff in src.zip.
Interesting Tidbits
-
Word-frequency analysis plays a central role in
providing the input data for tag clouds.
There are many uses for tag clouds, such as indicating words that are more common in some writing (e.g., someone's blog)
than they are more generally (e.g., on all web pages).
-
Word-frequency analysis also plays an important role in
Cryptanalysis, the science of breaking secretly encoded messages. The
first mention of using the frequency distribution of a language to
break codes was in a 14-volume Arabic encyclopedia written by
al-Qalqashandi in 1412. The idea is attributed to Ibn al-Duraihim, the
brilliant 13th century Arab Cryptanalyst. If you are interested in
cryptography, be sure to check out The Code Book
by Simon Singh. This is great introduction to cryptography and cryptanalysis.
-
Think computer science is all fun and games?
The Codebreakers,
by David Kahn, is a fascinating look at many of the problems solved by crypotanalysis, from breaking WWII secret
messages to the very question of who wrote Shakespeare's works!
Acknowledgments
A variant of this project was the third and final project in CSE326
for many years. The first word counter appeared during Spring 2000
when 326 was taught by the famous Steve Wolfman. The snowy-bearded
historian appeared in Autumn 2001, courtesy of Adrien Treuille. The
assignment was subsequently tweaked over the years by Matt Cary, Raj
Rao, Ashish Sabharwal, Ruth Anderson, David Bacon, Hal Perkins, and
Laura Effinger-Dean. Dan Grossman adapted it to be the second project
in CSE332. James Fogarty stayed up late during the snow-pocalypse of 2012
to clarify confusions experienced in previous offerings (which had
been carefully detailed by Ruth Anderson).