CSE326 Frequency Assignment
Spring 2002

Due: Monday, May 20, 2002, at the beginning of class.
Electronic turn-in due by 5:30 PM
Group assignments due to Nick by Friday, May 10

Reading

8.3--8.5. We probably won't be discussing extendible hashing much in this class, but read 8.4 if you like.

Programming

I. Introduction

You have just been approached by a world famous UW history professor. He would like you to settle a centuries-old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside of your area expertise. "Oh, no," chuckled the historian, stroking his snowy white beard. "I need a Computer Scientist!"

II. Word Frequency Analysis

The professor suspects that some authors use particular words more often than others. He hopes that if we study the frequency with which authors use words, we may be able to come up with a word usage "signature" for that author. This signature should be quite consistent across a particular author's works, but vary greatly between authors.

The professor wants you to write a program to compute the frequencies of words in a text. The program should take a text file on stdin, and count the number of times that each word occurs in each.

  freq -abc < some-text-file

where -abc are possible flags controlling the program. The output should be in the following format:

  970 the
  708 and
  666 of
  632 to
  521 I
  466 a
  466 in
  466 my
  383 you
    .
    .
    .

where the first number is the frequency that the second string occurs in the text. The professor will be grading... er, I mean, analyzing, your data automatically, so if your output deviates from this he will speak to his good friend N. Deibel and have him penalize you in your 326 class.

Strangely enough, the professor wants you to hand in this project using the turnin program to the CSE 326 database. He would like a copy of your source code, your Makefile, and a 1-2 paragraph answer to his question: based on the data you have accumulated, did Bacon write Shakespeare's plays?

III. 326 Details

Okay, there is no historian with a snowy white beard. There is only your Instructor, with flowing brown locks. Your program should be based on a Dictionary ADT represented in several ways. You should implement at least a BST tree, an AVL tree, a separately chained hash table, and an open addressed hash table with linear probing (but see Extra-Credit, below). A sample makefile and some code are available in the course directory at projects/freq.tar.gz. You should read these files, but you are not required to use them. What you must do is as follows.

You may work in groups of up to three people. All members of a group will receive the same grade. E-mail your group to Nick by Friday, May 10.
We must be able to grade your assignment by unpacking your turnin, and running make. That is, you must use a makefile, and your default target must build your executable.
Your executable must be named freq, and must produce the output as described above. Note that the output is sorted by frequency from highest to lowest, and if the frequencies are the same, sorted by string. To sort by string, use the < and > operators of the string class (see the sample code).
Your program must take its input on stdin, and print the frequency information to stdout. Nothing else is to be printed to stdout. Any errors or messages should be printed to stderr. The sample code has examples of how to do this.
Your program must take the following flags. The sample code has an example of an easy way to parse flags.
- -t, -a: use a BST or an AVL tree, respectively. How you produce the sorted output is up to you (and an excellent topic for your README).
- -c num, -o num: Use a separately chained hash table or an open addressed hash table with sequential probing, respectively, of size num. You can use the wc program to count how many words are in a file to get an idea of how large to make the hash table (note this is not quite what you want, as the number of different words will be less than the total number of words. But at least you can ballpark it). Nick has also put word_count and word_freq scripts in the course bin/ directory that you might find useful. Note that the sorting of the output from word_freq is slightly different from the requirements for this project.
  Your program should fail gracefully if an open addressed hash table overflows. You may use the hash function of your choice; I suggest the string hash function described in class.
- -l num: print only the num highest frequency words. If num is less than one, or this flag is not present, print the frequencies of all words sorted with most frequent word first.
- -h: print, to stderr, a brief message explaining the flags used by your program. This should include all of these flags as well as any flags used for any extra-credit.
You may add any additional flags you desire--for example, to help with debugging.
Your program should not filter the input: case and punctuation should matter. This actually makes your job easier, as your input loop can be very simple. The input loop in the sample code does exactly what is required.
Your README should include the following.
- The 1-2 paragraph answer to the question raised by the history professor. Use hamlet.txt and atlantis.txt found in the course directory under share/texts. These files were taken from Project Gutenberg; full copyright information can be found in the full- texts in the same location.
- Any design decisions made while doing this assignment. This includes both in the implementation of any of the data structures, and in the overall organization of your program.
- How the different members of the group, if any, contributed to the project. This will not be used to grade people differently, as all members of the group will receive the same grade. Rather, this is to make you think explicitly about how to divide up the workload.

The sample code provided build an executable that follows these directions, except that it uses only a single, and very stupid, Dictionary representation. Note that you should not test your program on the full texts; any short text file (such as your source code) will do. Any source code that you've seen or used in this class or in the textbook may be used on the project. Except for the specific exceptions for extra-credit noted below, you may not use any other code.

IV. Extra-Credit

This assignment has a multitude of extra-credit options. To get any extra credit, you must make a reasonable attempt for at least 6 points. For example, just adding a splay tree will get no extra credit, but adding a splay tree and dynamic hash resizing based on collisions would. If you make reasonable attempts for at least 6 points, you will still get extra credit even if your total score is not six. Continuing the example, if your hash resizing had minor problems, you might get 4 or 5 extra credit points. Even though this is less than 6, because you tried for 6 you will get credit.

To get credit, you must clearly list at the beginning of your README

which options you tried, and
the total number of points you're going for (to make it easier to resolve confusion about how much different options are worth).

Extra-Credit flag summary.

-s: Use splay tree
-q, -d: Use quadratic or double hash probing, respectively.
-w: Use word stemming to filter the input.

Extra-Credit Options

Use a splay tree in addition to the representations above. This representation should be invoked with the -s flag. (3 points)
Dynamically change the size of the hash table for better performance. A simple implementation of this would resize an open-addressed table to prevent overflow. (2 points)
A more involved option would grow the table when too many collisions start occurring (this means you'll have to think up a reasonable way to measure collisons, the sophistication of which will affect your score). (4 points)
Add quadratic or double hash probing for your open hash table implementation in addition to linear probing. For double hashing, the second hash function may be of your choice. For more credit, try several and report on how well they work (using either hash statistics or profiling, as described below). (1 point per extra probe method, 2 points per extra probe method if statistics generated as below)
Report hash table statistics when using a hash table representation.
- How do the number of probes for a successful or unsuccessful search compare with the theoretical estimates presented in class or the book?
- How does the number of probes change for different load factors? It may be the case that the theoretical estimates are valid for middling load factors, but break down when the table gets very full or very empty.
- How does using different hash function affect the performance? Which hash functions work best? You may want to try some of the simple hash functions I presented in class to see if they are really as bad as I said they were.
This extra-credit options is especially interesting when paired with the previous or next options. Remember that any statistics should be printed to stderr so as not to interfere with any frequency output. Even better, add a flag to turn statistic reporting on or off. (4 points)
Profiling - Profile all three trees using gprof. You should run your trees on at least two texts: share/texts/a-words.txt in the course directory, and a text of your choice from Project Gutenberg. Your README should include a short paragraph with the following: (6 points)
- The input files you profiled on.
- Your expected performance bottlenecks
- How your expectations differed (or were the same) as the gprof results. Did you find anything unusual/unexpected?
- The biggest bottleneck for your program, and what you think may be the cause
For even more fun, test the following as well.
- What's the difference in performance when you compile with optimization rather than debug? (Use -O3 instead of the -g flag in your makefile). (+1 point)
- Answer the questions raised by the previous option, using profiling information instead of hash statistics.(+2 points)
- Calculate hash statistics as described in the previous option. Are they an accurate predictor of actual performance? (+1 point per hash type or probe method examined)
Does frequency analysis produce a good author signature? Find some other texts by Shakespeare or Bacon from Project Gutenberg, and write a couple paragraphs in your README responding to this question. (1 point)
Word stemming is a process in which:
- common word suffixes like "ing", "ed", and "ly" are removed,
- plural words are made singular
- ... well, you get the picture. Other simplifications include removing prefixes and changing verbs to nouns if possible.
So, a word-stemming word-frequency-counter would count the word "car" twice in the following sentence: "The red car hit the blue cars".
Note that simply removing certain letters or strings at the end of words will not work: "Swiss" is not the plural of "Swis", nor is "bed" the past tense of "b". Simple word-stemming algorithms can be found on the Internet, so your best reference will probably be a search engine like Google
You should only do word stemming if the -w flag is passed to your program, as we still want to be able to check the core functionality of your program.
Word stemming algorithms are very complicated. Feel free to use code you find in the network in your algorithm, as long as you attribute any outside code used, in your README. (6 points)

Credits

This assignment is starting to become a fixture in 326. The first appearance of the snowy-bearded historian was in 326 winter 2000, taught by the famous Steve Wolfman. It has been incrementally improved over the years, most recently last quarter by Henry Kautz.

CSE326 Frequency AssignmentSpring 2002

Due: Monday, May 20, 2002, at the beginning of class. Electronic turn-in due by 5:30 PM Group assignments due to Nick by Friday, May 10