|
Project 3: Shake-n-Bacon
Data Mining with Trees and Tables
|
Important Deadlines
Project released |
Sat, Feb 13 |
|
Partner selection |
Mon, Feb 22 |
Send email to Tyler by 11pm |
Code Due:
code for requirements Va, Vb and Vc
(no writeup) |
Mon, March 1 Wed, March 3 |
Electronic turnin by 11pm |
Benchmarking, Writeup, Above & Beyond Code Due:
Submit README.txt and Above & Beyond code |
Thurs, March 4 Fri, March 5 |
Electronic turnin by 11pm |
Clarifications
Check out the clarifications to the project here.
Outline
I. Introduction
You have just been approached by a world famous UW history professor.
He would like you to settle
a centuries-old debate
on who wrote Shakespeare's plays,
Shakespeare or
Sir Francis
Bacon? You protest that this question is surely outside of your area
of expertise. "Oh, no," chuckles the historian, stroking his snowy white
beard. "I need a Computer Scientist!"
II. Learning Objectives
For this assignment, you will:
- Be introduced to DataCounter, a variant of
the DictionaryADT, and understand the
strengths of various DataCounter implementations
- Gain proficiency with Splay and AVL trees
- Become comfortable with implementing a HashTable
- Understand some of the complexities of word stemming
- Implement sorting algorithms
- Learn how to benchmark your code
- Further experience with testing (unit tests)
III. Word Frequency Analysis
Authors tend to use some words more often than others. For example,
Shakespeare used "thou" more often than Bacon. The professor
believes that a "signature" can be found for each author, based on
frequencies of words found in the author's works, and that this signature
should be consistent across the works of a particular author but vary
greatly between authors. He wants you to come up with a way of
quantifying the difference between two written works, and to use your
technique on several of Shakespeare's and Bacon's works to settle the
ancient debate!
The professor has provided you with copies of Shakespeare's
(Hamlet) and Bacon's writing
(The New Atlantis), which he has
painstakingly typed by hand, from his antique, leather-bound
first-editions. Being good scientists, however, you quickly realize that
it is impossible to draw strong conclusions based on so little data,
and asking him to type two more books is out of the question!
Thus, you should download and analyze several more works, as many works as
you feel is necessary to support your conclusion. (Project
Gutenberg is a good
place to start looking.)
IIIa. Word Stemming
When dealing with document correlations, it is often desirable to
work only with the roots of words. That way, "sleeps", "sleeping",
and "sleep" are all considered to be the same word. This process is
called word stemming, and is used in most real-world search
engines. For this assignment, you only need to follow two
simple rules:
- Convert all the words to lowercase ("An" and "an" are the same word)
- Remove all punctuation ("end." and "end" are the same word)
The supplied class FileWordReader includes code to do this
processing for you.
Word stemming is a fairly complex topic. What these rules do is
not so much word stemming as input normalization; you do not try to
undo conjugations or other morphology. Fancier word stemming such as
removing 's' from the end of a word can lead to erroneous results
(such as "bu" from "bus") and require special logic. Even our simple
rules cause problems; for instance, "it's" and "its" are now the same
word. Implementing a better stemming algorithm (like Porter
Stemming) is above and beyond work.
IIIb. Signature Generation
A fundamental part of your work lies in computing the "signature" of
a document. The professor has provided you with a sample
WordCount
program that reads in a document and counts the
number of times that a stemmed word appears, assuming that the
document's words are already stemmed.
The output of this program looks like this:
970 the
708 and
666 of
632 to
521 at
521 i
521 into
466 a
444 my
391 in
383 buffalo
...
where the number in column 1 is the frequency that the
corresponding string in column 2 occurs in the text. Note that the
WordCount
program sorts its output primarily in decreasing
order by frequency count, secondarily by alphabetical order. The
ordering would be identical if it had sorted by frequency
fraction first (i.e.
frequency_count/num_total_words
).
IIIc. Document Correlation
Document correlation is a reasonably large area of study. Perhaps
its most visible application is in search engines which rank the
correlation of webpages to a set of keywords that you provide. One
model often used for correlating documents is Latent
Semantic Indexing (LSI) where a collection of documents is considered
together (rather than independently) and a word's usefulness is
determined by how frequently it appears in the documents (for
instance, "the" isn't very useful because it appears in most
documents).
We will not be doing LSI (it is, however, an extra credit option). We will do a simpler
document comparison:
- Calculate word counts for the two documents and normalize the
frequencies so that they can be meaningfully compared between
different documents (hint: use frequency percentages
or fractions.)
- As in LSI, remove words whose relative frequencies are too high
or too low to be useful to your study. A good starting point
is to remove words with normalized frequencies above 0.01 (1%) and
below 0.0001 (0.01%), but feel free to play around with these
numbers. You should, however, report results with these frequencies so
we can compare outputs with a standard set of numbers.
- For every word that occurs in both documents, take the
difference between the normalized frequencies, square that
difference, and add the result to a running sum.
- The final value of this running sum will be your difference
metric. This metric corresponds to the square of the
Euclidean distance between the two vectors in the space of
shared words in the document. Note that this metric assumes that
words not appearing in both documents do not affect the
correlation.
IV. Partners
You are encouraged (although not required) to work with a partner of
your own choosing for this project. No more than two students total can be on a
team together. You may divide the work however you wish, under
three conditions:
- Document each team member's effort in the README.txt file
- Work together and make sure you both understand your
answers to the README.txt questions below
- Understand (at least) at a high level how your team
member's code is structured and how it works.
Remember to test your team's code as a whole
to make sure that your portions work together properly! Do not
attempt to merge your code on the project due date. You will very
likely have problems. Also, be aware that except in extreme cases
when you notify us in advance of the deadline, all team members
will receive the same grade for the project.
V. Requirements
There are five steps in this project:
- Write three DataCounter
dictionary implementations (AVL, Splay, Hash) and unit tests for each
implementation.
- Modify
WordCount
to be able
use your DataCounter implementations, and to select
the implementation at runtime. Your new program must include a
new sorting algorithm to replace the existing insertion sort.
- Write a document correlator that will print a
difference score between two documents
- Benchmark your data structures and
correlator
- Analyze and write up the results of your
performance tests
The analysis and writeup will be
significantly longer in this
project. Be sure to allocate time for it. It is worth 1/3 of your grade,
and you will not be able to do it in an hour or two.
Va. DataCounter Implementations
For this assignment, you must implement three data structures:
- AVL tree
- Splay tree (recursive bottom up as discussed in class)
- Hash table
All three of these data structures must implement the DataCounter
interface, which is a specialized version of DictionaryADT
. You do not need to implement remove
in any of these structures (doing so is Above and Beyond). You can implement any hash tables discussed in class; the only restriction is that it should not restrict the size of the input domain or the number of inputs (i.e. your hash table must grow). As with the last project, you are free to use code and skeleton code from the text book (but not the text book's web site) in implementing your structures as long as you implement the DataCounter
interface. Do not use code from the web or other sources.
The history professor has provided an unbalanced BinarySearchTree for you; feel free to use, modify, extend, or discard it. If you discard it, you should create a new version of BinarySearchTree with the same functionality. Your AVLTree and SplayTree should extend BinarySearchTree.
Also each of your data structures must include a test suite. The code for BinarySearchTree.java includes a test written as a main program to give you an example of the kinds of checks you might do. We strongly suggest you use JUnit 4.
Vb. WordCount
The WordCount program will read in a text file and tally up all the words that appear in it. The WordCount program given to you currently uses an unbalanced binary search tree as its backing DataCounter implementation and contains an insertion sort. You may base your WordCount program on it, or write your own. You need to add additional DataCounter implementations and a new sorting method.
The commandline form for WordCount will be as follows:
java WordCount [ -b | -a | -s | -h ] [ -frequency | -num_unique ] <filename>
-b
Use an Unbalanced BST to
implement the DataCounter
-a
Use an AVL Tree
-s
Use a Splay Tree
-h
Use a Hashtable
-frequency
Print all the word/frequency
pairs, ordered by frequency, and then by the words in lexicographic
order
-num_unique
Print the number of unique
words in the document. This is the total number of distinct (different) words
in the document. Words that appear more than once are only counted as a single
word
for this
statistic.
It is fine to require that one of -b, -a, -s,
or -h
must be specified for your program to run. Your program should not crash,
however, if given an invalid command line.
Note that for the -frequency option, you need to produce
words ordered primarily by frequency and secondarily by lexicographic
(i.e., alphabetical) order. For example:
43 these
42 upon
42 your
41 after
41 into
40 said
39 many
39 more
38 an
The sample WordCount program does this sorting using Insertion
Sort.
You
must write a better sorting algorithm for this
in your final WordCount program.
Note that you may not use any built-in Java sorting functions.
Vc. Document Correlator
The Document Correlator should take in 2 documents and return the
correlation (difference metric calculation) between them. You may want
to use the WordCount class given to you to implement the backend of the
Correlator, though doing so is not necessary. For
the basic requirements, you must design an algorithm that does the
comparison specified in section
IIIc Document
Correlation.
This program should also take command line flags to specify which
backing data structure to use. The commandline structure should be:
Usage: java Correlator [ -b | -a | -s | -h ] <filename1> <filename2>
-b
Use an Unbalanced BST in the
backend
-a
Use an AVL Tree in the backend
-s
Use a Splay Tree in the backend
-h
Use a Hashtable in the backend
Vd. Benchmarks
Since we are implementing so many DataCounter dictionaries in this
project, it is natural to ask "which is faster." Benchmarking and profiling
are really the only reliable ways to judge this since there are many
many hidden assumptions in the way you write your code that will add
unexpected constants to your program. Hopefully you will do some
exploration in this assignment and prove to yourself that you really
can't predict what will affect program runtime too much (go through
and try to optimize away little things like how many assignments you
do, how many if statements you execute, etc. and see how much or
little this affects your program).
When generating (or reading) benchmarks, you must ask yourself the following
questions:
- What am I measuring? Speed is too vague. Does it mean full program
runtime? What if my program waits for user input? Does it matter?
- Why am I measuring this and why should anyone be interested in it? Full
program runtime of an interactive user app where the users fall asleep
while running the code isn't really interesting data.
- What methodology will I use to measure my program? Does it actually
measure what I want?
- What are the sources of error? Is the error big enough to matter? Are my
results still reliable?
This set of questions actually applies to any analysis.
You are required to design benchmarks that measure the
attributes listed below. You may also include any other data that you
feel necessary to draw conclusions from the benchmarks in your
analysis.
- How fast is
java WordCount -frequency
compared to
count.sh and count.pl?
- How much difference in speed do your different DataCounters make in the
correlator and/or the wordcount?
There are a few tools available to you for benchmarking. The simplest
ones are:
- The Unix
time
command.
System.nanoTime()
or System.currentTimeMillis()
(record
the time at two different places in your program and subtract to get running
time)
Both methods have their strengths and weaknesses (for instance,
time
must measure your process creation times, and JVM startup times). These
strengths and weaknesses will exhibit themselves differently depending
on where and how these tools are used. In your analysis, you will need
to cite the known sources for errors in your benchmarks and justify why
they don't matter for your measurements, or somehow create a correction
for your measurement. Essentially, you must convince us that your
benchmark is measuring something that makes sense and that your analysis
can be based off the collected data.
For example, to time count.sh
or count.pl
,
you can do the following:
time ./count.sh your-file
The syntax is the same for
count.pl. Use the
User time
value that
time returns.
Ve. README.txt
Your README.txt file needs to answer the following questions:
- Who is in your group?
- Acknowledgment of any assistance you received from anyone or
anywhere but your team members, the 326 staff, or the Weiss book.
- How long did the project take?
- Which sorting algorithm did you implement for WordCount? Why?
Under what circumstances might other sorting algorithms work better?
- Before you started, which data structure did you expect would be the fastest?
- Which data structure is the fastest? Why were you right or wrong?
- In general, which DataCounter dictionary implementation was
"better": trees or hash tables? Note that you will need to define
"better" (ease of coding, ease of debugging, memory usage, disk
access patterns, runtime for average input, runtime
for all input, etc).
- Are there cases in which a particular data structure performs
really well or badly in the correlator? Enumerate the cases for
each data structure.
- Give a one to two paragraph explanation of whether or not you
think Bacon wrote Shakespeare's plays based on the data you
collected. No fancy statistical analysis here (formal analysis
comes later); keep it fun and simple.
- Give a detailed description of all Above and Beyond projects
which you implemented. What did you find most challenging about the
projects you implemented? What did you find most interesting?
- Writeup your benchmarks (this is long). Your mission is to
convince us that your benchmark makes sense and that we should be
interested in it if we are trying to ascertain which data structure
is better suited for your input. You will need to answer at least
the following (all formal analysis should answer something similar):
- What are you measuring?
- What is the definition of "better" given your measurement?
- Why is the measurement interesting in determining which is the
superior algorithm for this project?
- What was your method of benchmarking?
- What were the sources of errors?
- What were your results?
- How did you interpret your results?
- What were your conclusions?
- Are there any interesting directions for future study?
You may attach this in a separate, non-plain-text file.
If you do use a non-plaintext format please ensure that the file extension matches
the format. Also, please consider using PDFs in place of proprietary formats (i.e
.doc(x), .xls(x), .pages, .numbers, etc.) which will just cause your TAs extra
headaches.
- What did you enjoy about this assignment? What did you hate?
Could we have done anything better?
VI. Files and Sample Code
Sample texts and starter code are available on the course web site in file
project3files.zip. (Right-click on the link
to download the file if a single click doesn't work.) You can also
get texts
of
your own at Project Gutenberg, which
has thousands of books as plain text files. Their mission is to provide
electronic versions of many popular public domain texts. Check it out!
Note that these files contain some header text that is not part of the
actual play or book. For accurate results you should remove everything
that is not part of the work. Extra text occurs at the beginning and end
or each file, and sometimes at the beginning of each act of a play.
Try running your word-counting program on the King James Bible.
(Guess which word comes up more frequently in the Bible: "he" or "she?"...
and by a factor of what?). Also, if you have any special requests for
texts or other cool files you'd like to have added to the test files,
email
the course staff
.
In addition, your history professor has provided some code which he
wrote (these days, everybody knows how to program!). You may use it
if you wish, although your code must follow the provided DataCounter interface:
- DataCounter - Specification of an interface for a DataCounter. Your classes must implement this
interface. (Note that DataCounter is a
dictionary that is specialized for this particular task,
so it isn't as generalized as
some of the ADTs we've seen in the past. This is primarily for
performance reasons.)
- BinarySearchTree - Specification and implementation of
an unbalanced binary search tree class. Use of the provided BinarySearchTree implementation is optional (you may choose
to implement your own), but your AVL and Splay tree classes
must inherit from your BinarySearchTree
implementation.
- BinarySearchTree.BSTNode- Specification and implementation of a
binary search tree node (used in the BinarySearchTree
class).
- FileWordReader - A class that reads in a file and
does simple word stemming.
- WordCount - A simple program that reads words from a
FileWordReader and tallies their frequency in a DataCounter.
- count.sh - A Unix shell script to compute word counts.
Usage:
./count.sh your-file
- count.pl - Similar to count.sh, except in Perl.
VII. Turnin
Code Turn-in
For the code turn-in, you are expected to turn in all your code (AVLTree, SplayTree, HashTable, BinarySearchTree (if you modified it), WordCount, Correlator, your unit tests, and any other code of interest). You do not have to include any Above and Beyond code at this time.
Note that the code you submit for the Code turn-in should be functional, fully-tested code.
- Include your (and your partner's) real name AND CSE NetID in every file you turn in. If you don't you will lose points.
- Turn in individual files. Do not turn in a tar, zip, or other archive. If you do you will lose points.
- If you turn in a file, ensure that it has the appropriate extension. Renaming a .doc to .txt does not make it a plain text file, and causes problems. If you miss this you will lose points.
- Please remember to remove debugging output before submitting your code.
Writeup and Benchmarking turn-in
For the final checkin, include all of the above,
your README.txt, your benchmark analysis, and any extra credit stuff you
implement. You may turn in some format other than a plain-text file,
especially for the analysis part. However, please do not turn in a file
format that is not easily portable. Consider using PDFs.
If you're working with a partner, the same person should submit
for the in-progress checkin and the final checkin.
No paper turnin is required for this project.
VIII. Grading Breakdown
Each part of the assignment will be worth (approximately) the
following percentages. Please budget your time appropriately.
35%
| Program correctness (including boundary cases) and
error-free compilation
|
20%
| Architecture/design of program. Style, commenting, layout.
Other forms of program documentation.
|
35%
| README.txt and answers to writeup questions, including
benchmarking
|
10%
| Quality and comprehensiveness of turned-in unit tests
|
IX. Going Above and Beyond
- Algorithm Design Techniques -
If you wrote your tree algorithms iteratively, re-implement them to
be recursive (or vice versa), and answer the following questions in
your README.txt: Which algorithm design technique did you find
easier to code? Which was more elegant? Had a faster runtime?
How would you define "better", and which design technique do you
think is "better" in this program? Can you think of situations
where one design technique is not applicable?
- Alternative Trees -
Implement the program with Red-Black trees (see textbook). The advantage of
Red-Black trees is that you can write non-recursive
insertion/deletion algorithms (the trade-off is that Red-Black trees
have weaker balancing condition, though they do guarantee O(log(n))
depth). Don't cheat; write non-recursive algorithms here. In
your README.txt, comment on which tree implementation was easiest to
write/debug, which was the fastest, and, if you needed to write a
tree for general use (eg, a tree to be used by all the 326 students
for all their projects), which would it be: an unbalanced BST, an
AVL tree, a splay tree, or a red-black tree? Why?
- Keeping Performance Information -
Add code to your program so that you can track the number of
comparisons and the number of rotations performed by your tree.
For this project, you will need to have implemented both a splay
tree and an AVL tree. Predict how the two trees would
compare. How did they actually compare? Were you surprised?
- Alternative Hashing Strategies -
If you wrote a closed-hashing table, implement linear,
quadratic, and one other probing strategy (you may make up your
own, if you wish). The user should be able to select their
probing strategy with command line arguments. Does one probing
strategy always work better/worse than the others? Why do you
think this is the case? Are there types of input for which
one probing strategy works better than another? Which has a
greater impact on your hash table's performance: the hash
function, or the probing strategy? If you wrote an open-hashing
table, implement a secondary dictionary instead of a linked list
(perhaps you can reuse your tree implementation?). In your README.txt, answer the following questions: does a secondary
dictionary increase or decrease the runtime for your hash
table for all inputs? On some inputs? How difficult was
it to implement a secondary dictionary?
- Data Locality -
Add code to your binary search tree which keeps track of the
average depth of a node in the tree over the course of a run.
Compare the average depths of some very common and some very
uncommon words for unbalanced binary search trees, AVL trees,
and splay trees over the course of parsing a file.
- Profiling -
Profile your program using
hprof
A profiler is a tool which enables the
programmer to obtain detailed information about how their
program performs, such as the number of times a function is
called, or how much of the program's runtime was spent in a
particular function call. Compare two tree or two hash table
implementations using a profiler. Your README.txt should include a
paragraph with the following information:
- Your expected performance bottlenecks
- How your expectations differed (or were the same) as the profiler's results.
- Did you find anything unusual/unexpected?
- What is the biggest bottleneck for your program, and what do you think may be the cause?
- Deletion -
Currently, the DataCounter interface which we have provided
does not support deletion of elements. Add deletion to the
interface and to all data structures that you've written which
implement this interface. Document your additions to the
interface in your README.txt. You should also extend your JUnit tests to test your new code.
- Algorithmic Analysis -
Implement selection sort, an optimal sorting algorithm, and a
third algorithm that does not have O(n2) running time -- examples
are another optimal sorting algorithm, radix sort, or bucket
sort. These algorithms should have a nice command-line interface
for timing tests. In your README.txt, include a short paragraph
comparing the three algorithms, including a table or graphs.
For extended extra credit, vary the distribution of
the input so that the relationships between the running times of
your various sorting algorithms change - this will especially
make a difference if bucket sort is implemented. Your README.txt
should describe what distributions you used, and why they made a
difference.
- Introspective Sort -
Introspective
sort is an unstable QuickSort variant which switches
to HeapSort for inputs which would result in a O(n2) for
normal QuickSort. Thus, it has an average-case and a worst-case
runtime of O( n log2 n ), but generally runs faster than
HeapSort even in the worst case. Implement IntroSort, and
give a sample input which would result in a quadratic runtime for
normal QuickSort (using a median-of-3 partitioning scheme).
- Iterators -
Add iterators to the DataCounter interface and implement
them for the classes which
implement DataCounter. In your README.txt, comment on whether you
think the added complexity of writing an iterator outweighs the
simplification of your algorithms, and any difficulties you
found while writing your iterators.
- Visualization -
We have provided you with a primitive method for printing out trees.
Make a full-blown tree visualization tool to better test and debug
tree code. This option is worth two "Above and Beyond"
projects.
- Word Stemming -
Word stemming is a process in which:
- Common word suffixes like "ing", "ed", and "ly" are removed,
- Plural words are made singular,
- ... well, you get the picture. Other simplifications
include removing prefixes and changing verbs to
nouns if possible.
So, a word-stemming word-frequency-counter would count the word
"buffalo" twice in the following sentence: "The bald buffalo
charged at the herd of shaggy buffaloes".
Note that simply removing certain letters or strings at the end
of words will not work: "Swiss" is not the plural of "Swis", nor
is "bed" the past tense of "b". Simple word-stemming algorithms
can be found on the internet, so your best reference will probably
be a search engine like Google.
Please only use the
web as an algorithm reference; do not copy code directly.
As with document correlation, word stemming is another fairly complex
topic. A common algorithm for doing word stemming is the Porter Stemming
Algorithm. Implementing this algorithm is Above and Beyond (they
have source code posted. If you try to do this project, please try to
implement the algorithm from scratch only referring to their source if
absolutely necessary. If you end up looking at their source, be sure to
cite it.).
Stemming algorithms of interest include the Porter
Stemming Algorithm (a complicated but very widely-used 5-step
suffix-removal algorithm) and the
Paice/Husk Stemming Algorithm (a simpler iterative
suffix-remover). This option is worth two "Above and
Beyond" projects.
- Word co-occurance -
A phenomenon even more interesting than word frequency is word
co-occurrence. Create a new WordCount that counts the
frequency of pairs of words. Your code should insert as a pair
any two words which appear within k words of each other, where k
is a command line argument (try 10 for test runs). How do BST,
AVL, and splay trees compare now? This option is worth two
"Above and Beyond" projects.
- Latent Semantic Analysis -
The underlying theory behind word co-occurrence is what is known
as Latent Semantic Analysis. Check out the
LSA website at Colorado
University for more information, and modify the
word_count program to find possible
polysemies
or
synonymies.
This option is worth three "Above and Beyond" projects.
- Go crazy! -
Have a cool idea that's not on this list? Then go for it! If
you want to go drastically beyond the basic project
specifications, check with the instructor or the TAs before you start.
Of course, your code should still meet the basic requirements.
X. Interesting Tidbits
- Word frequency analysis plays an important role in
Cryptanalysis, the science of breaking secretly encoded
messages. The first mention of using the frequency distribution
of a language to break codes was in a 14-volume Arabic
encyclopedia written by al-Qalqashandi in 1412. The idea is
attributed to Ibn al-Duraihim, the brilliant 13th century Arab
Cryptanalyst. If you are interested in cryptography, be sure to
check out The
Code Book by Simon Singh. This is great introduction to
cryptography and cryptanalysis.
- Think computer science is all fun and games?
The Codebreakers, by David Kahn, is a fascinating look at
many of the problems solved by crypotanalysis, from breaking
WWII secret messages to the very question of who wrote
Shakespeare's works!
XI. Credits
This assignment is has become a fixture in 326. The first word
counter appeared during Spring 2000 when 326 was taught by the famous
Steve Wolfman. The snowy-bearded historian appeared in Autumn 2001,
courtesy of Adrien Treuille. This assignment has been otherwise tweaked
over the years by Matt Cary, Raj Rao, Ashish Sabharwal, Ruth Anderson,
David Bacon, Hal Perkins, and Laura Effinger.