CSE326 Frequency Assignment
Spring 2002
Due: Monday, May 20, 2002, at the beginning of class.
  Electronic turn-in due by 5:30 PM
  Group assignments due to Nick by Friday, May 10
Reading
- 8.3--8.5. We probably won't be discussing extendible hashing
much in this class, but read 8.4 if you like.
Programming
I. Introduction
You have just been approached by a world
famous UW history professor. He would like you to settle a centuries-old debate
on who wrote Shakespeare's plays,
Shakespeare or
Sir Francis
Bacon? You protest that this question is surely outside of your
area expertise. "Oh, no," chuckled the historian, stroking his snowy
white beard. "I need a Computer Scientist!"
II. Word Frequency Analysis
The professor suspects that some authors use particular words more often
than others. He hopes that if we study the frequency with which authors
use words, we may be able to come up with a word usage "signature" for
that author. This signature should be quite consistent across a particular
author's works, but vary greatly between authors.
The professor wants you to write a program to compute the
frequencies of words in a text. The program should take a text file
on stdin, and count the number of times that each word occurs in each.
freq -abc < some-text-file
where -abc are possible flags controlling the program. The
output should be in the following format:
970 the
708 and
666 of
632 to
521 I
466 a
466 in
466 my
383 you
.
.
.
where the first number is the frequency that the second string
occurs in the text. The professor will be grading... er, I mean,
analyzing, your data automatically, so if your output deviates
from this he will speak to his good friend N. Deibel and have him
penalize you in your 326 class.
Strangely enough, the professor wants you to hand in this project
using the turnin program to the CSE 326 database. He would like a copy
of your source code, your Makefile, and a 1-2 paragraph answer to his
question: based on the data you have accumulated, did Bacon write
Shakespeare's plays?
III. 326 Details
Okay, there is no historian with a snowy white beard. There is only
your Instructor, with flowing brown locks. Your program should be
based on a Dictionary ADT represented in several ways. You should
implement at least a BST tree, an AVL tree,
a separately chained hash table, and
an open addressed hash table with linear probing (but see
Extra-Credit, below). A sample makefile and some code are available
in the course directory at projects/freq.tar.gz. You should
read these files, but you are not required to use them. What you must
do is as follows.
- You may work in groups of up to three people. All members
of a group will receive the same grade. E-mail your group to Nick by
Friday, May 10.
- We must be able to grade your assignment by unpacking your
turnin, and running make. That is, you must use a makefile,
and your default target must build your executable.
- Your executable must be named freq, and must produce the
output as described above. Note that the output
is sorted by frequency from highest to lowest, and if the frequencies
are the same, sorted by string. To sort by string, use the < and >
operators of the string class (see the sample code).
- Your program must take its input on stdin, and print the
frequency information to stdout. Nothing else is to be printed to
stdout. Any errors or messages should be printed to stderr. The
sample code has examples of how to do this.
- Your program must take the following flags. The sample code has
an example of an easy way to parse flags.
- -t, -a: use a BST or an AVL tree, respectively. How you
produce the sorted output is up to you (and an excellent topic for
your README).
- -c num, -o num: Use a separately chained hash table or an open
addressed hash table with sequential probing, respectively, of size
num. You can use the wc program to count how many words are
in a file to get an idea of how large to make the hash table (note
this is not quite what you want, as the number of different words
will be less than the total number of words. But at least you can
ballpark it). Nick has also put word_count and
word_freq scripts in the course bin/ directory that
you might find useful. Note that the sorting of the output from
word_freq is slightly different from the requirements for this
project.
Your program should fail gracefully if an open addressed
hash table overflows. You may use the hash function of your
choice; I suggest the string hash function described in class.
- -l num: print only the num highest frequency
words. If num is less than one, or this flag is not
present, print the frequencies of all words sorted with most
frequent word first.
- -h: print, to stderr, a brief message explaining the flags
used by your program. This should include all of these flags as
well as any flags used for any extra-credit.
You may add any additional flags you desire--for example, to help with
debugging.
- Your program should not filter the input: case and punctuation
should matter. This actually makes your job easier, as your input
loop can be very simple. The input loop in the sample code does
exactly what is required.
- Your README should include the following.
- The 1-2 paragraph answer to the question raised by the history
professor. Use hamlet.txt and atlantis.txt found in
the course directory under share/texts. These files were
taken from Project Gutenberg;
full copyright information can be found in the full- texts
in the same location.
- Any design decisions made while doing this assignment. This
includes both in the implementation of any of the data structures,
and in the overall organization of your program.
- How the different members of the group, if any, contributed to
the project. This will not be used to grade people differently, as
all members of the group will receive the same grade. Rather, this
is to make you think explicitly about how to divide up the workload.
The sample code provided build an executable that follows these
directions, except that it uses only a single, and very stupid,
Dictionary representation. Note that you should not test your program
on the full texts; any short text file (such as your source code) will
do.
Any source code that you've seen or used in this class or in the
textbook may be used on the project. Except for the specific
exceptions for extra-credit noted below, you may not use any other
code.
IV. Extra-Credit
This assignment has a multitude of extra-credit options. To get any
extra credit, you must make a reasonable attempt for at least 6
points. For example, just adding a splay tree will get no extra
credit, but adding a splay tree and dynamic hash resizing based on
collisions would. If you make reasonable attempts for at least 6
points, you will still get extra credit even if your total score is
not six. Continuing the example, if your hash resizing had minor
problems, you might get 4 or 5 extra credit points. Even though this
is less than 6, because you tried for 6 you will get credit.
To get credit, you must clearly list at the beginning of your README
- which options you tried, and
- the total number of points you're going
for (to make it easier to resolve confusion about how much different
options are worth).
Extra-Credit flag summary.
- -s: Use splay tree
- -q, -d: Use quadratic or double hash probing, respectively.
- -w: Use word stemming to filter the input.
Extra-Credit Options
-
Use a splay tree in addition to the representations above. This
representation should be invoked with the -s flag. (3 points)
-
Dynamically change the size of the hash table for better performance.
A simple implementation of this would resize an open-addressed table
to prevent overflow. (2 points)
A more involved option would grow the table when
too many collisions start occurring (this means you'll have to think
up a reasonable way to measure collisons, the sophistication of which
will affect your score). (4 points)
- Add quadratic or double hash probing for your open hash
table implementation in addition to linear probing. For double
hashing, the second hash function may be of your choice. For more
credit, try several and report on how well they work (using either
hash statistics or profiling, as described below). (1 point per
extra probe method, 2 points per extra probe method if statistics
generated as below)
- Report hash table statistics when using a hash table
representation.
- How do the number of probes for a successful or unsuccessful
search compare with the theoretical estimates presented in class
or the book?
- How does the number of probes change for different load
factors? It may be the case that the theoretical estimates are
valid for middling load factors, but break down when the table
gets very full or very empty.
- How does using different hash function affect the
performance? Which hash functions work best? You may want to try
some of the simple hash functions I presented in class to see if
they are really as bad as I said they were.
This extra-credit options is especially interesting when paired with
the previous or next options. Remember that any statistics should
be printed to stderr so as not to interfere with any frequency
output. Even better, add a flag to turn statistic reporting on or
off. (4 points)
-
Profiling - Profile all three trees using
gprof.
You should run your trees on at least two texts:
share/texts/a-words.txt in the course directory, and a text of
your choice from Project
Gutenberg. Your README should include a short paragraph
with the following: (6 points)
- The input files you profiled on.
-
Your expected performance bottlenecks
-
How your expectations differed (or were the same) as the
gprof results. Did you find anything unusual/unexpected?
-
The biggest bottleneck for your program, and what you think
may be the cause
For even more fun, test the following as well.
-
What's the difference in performance when you compile with
optimization rather than debug? (Use -O3 instead of the -g flag in
your makefile). (+1 point)
- Answer the questions raised by the previous option, using
profiling information instead of hash statistics.(+2 points)
- Calculate hash statistics as described in the previous
option. Are they an accurate predictor of actual performance?
(+1 point per hash type or probe method examined)
- Does frequency analysis produce a good author signature?
Find some other texts by Shakespeare or Bacon from Project Gutenberg, and write a
couple paragraphs in your README responding to this question. (1 point)
-
Word stemming is a process in which:
- common word suffixes like "ing", "ed", and "ly" are removed,
- plural words are made singular
- ... well, you get the picture. Other simplifications include
removing prefixes and changing verbs to nouns if possible.
So, a word-stemming word-frequency-counter would count the word
"car" twice in the following sentence: "The red car hit the blue
cars".
Note that simply removing certain letters or strings
at the end of words will not work: "Swiss" is not the
plural of "Swis", nor is "bed" the past tense of "b".
Simple word-stemming algorithms can be found on the Internet,
so your best reference will probably be a search engine like
Google
You should only do word stemming if the -w flag is passed
to your program, as we still want to be able to check the core
functionality of your program.
Word stemming algorithms are very complicated. Feel free to use
code you find in the network in your algorithm, as long as you
attribute any outside code used, in your README.
(6 points)
Credits
This assignment is starting to become a fixture in 326. The first
appearance of the snowy-bearded historian was in 326 winter 2000,
taught by the famous Steve Wolfman. It has been incrementally
improved over the years, most recently last quarter by Henry Kautz.