CSE326 Frequency Assignment
Spring 2002

Due: Monday, May 20, 2002, at the beginning of class.
  Electronic turn-in due by 5:30 PM
  Group assignments due to Nick by Friday, May 10

Reading

Programming

I. Introduction

You have just been approached by a world famous UW history professor. He would like you to settle a centuries-old debate on who wrote Shakespeare's plays, Shakespeare or Sir Francis Bacon? You protest that this question is surely outside of your area expertise. "Oh, no," chuckled the historian, stroking his snowy white beard. "I need a Computer Scientist!"

II. Word Frequency Analysis

The professor suspects that some authors use particular words more often than others. He hopes that if we study the frequency with which authors use words, we may be able to come up with a word usage "signature" for that author. This signature should be quite consistent across a particular author's works, but vary greatly between authors.

The professor wants you to write a program to compute the frequencies of words in a text. The program should take a text file on stdin, and count the number of times that each word occurs in each.

  freq -abc < some-text-file
where -abc are possible flags controlling the program. The output should be in the following format:
  970 the
  708 and
  666 of
  632 to
  521 I
  466 a
  466 in
  466 my
  383 you
    .
    .
    .

where the first number is the frequency that the second string occurs in the text. The professor will be grading... er, I mean, analyzing, your data automatically, so if your output deviates from this he will speak to his good friend N. Deibel and have him penalize you in your 326 class.

Strangely enough, the professor wants you to hand in this project using the turnin program to the CSE 326 database. He would like a copy of your source code, your Makefile, and a 1-2 paragraph answer to his question: based on the data you have accumulated, did Bacon write Shakespeare's plays?

III. 326 Details

Okay, there is no historian with a snowy white beard. There is only your Instructor, with flowing brown locks. Your program should be based on a Dictionary ADT represented in several ways. You should implement at least a BST tree, an AVL tree, a separately chained hash table, and an open addressed hash table with linear probing (but see Extra-Credit, below). A sample makefile and some code are available in the course directory at projects/freq.tar.gz. You should read these files, but you are not required to use them. What you must do is as follows.

  1. You may work in groups of up to three people. All members of a group will receive the same grade. E-mail your group to Nick by Friday, May 10.
  2. We must be able to grade your assignment by unpacking your turnin, and running make. That is, you must use a makefile, and your default target must build your executable.
  3. Your executable must be named freq, and must produce the output as described above. Note that the output is sorted by frequency from highest to lowest, and if the frequencies are the same, sorted by string. To sort by string, use the < and > operators of the string class (see the sample code).
  4. Your program must take its input on stdin, and print the frequency information to stdout. Nothing else is to be printed to stdout. Any errors or messages should be printed to stderr. The sample code has examples of how to do this.
  5. Your program must take the following flags. The sample code has an example of an easy way to parse flags. You may add any additional flags you desire--for example, to help with debugging.
  6. Your program should not filter the input: case and punctuation should matter. This actually makes your job easier, as your input loop can be very simple. The input loop in the sample code does exactly what is required.
  7. Your README should include the following.
The sample code provided build an executable that follows these directions, except that it uses only a single, and very stupid, Dictionary representation. Note that you should not test your program on the full texts; any short text file (such as your source code) will do. Any source code that you've seen or used in this class or in the textbook may be used on the project. Except for the specific exceptions for extra-credit noted below, you may not use any other code.

IV. Extra-Credit

This assignment has a multitude of extra-credit options. To get any extra credit, you must make a reasonable attempt for at least 6 points. For example, just adding a splay tree will get no extra credit, but adding a splay tree and dynamic hash resizing based on collisions would. If you make reasonable attempts for at least 6 points, you will still get extra credit even if your total score is not six. Continuing the example, if your hash resizing had minor problems, you might get 4 or 5 extra credit points. Even though this is less than 6, because you tried for 6 you will get credit.

To get credit, you must clearly list at the beginning of your README

Extra-Credit flag summary.

Extra-Credit Options

  1. Use a splay tree in addition to the representations above. This representation should be invoked with the -s flag. (3 points)

  2. Dynamically change the size of the hash table for better performance. A simple implementation of this would resize an open-addressed table to prevent overflow. (2 points)
    A more involved option would grow the table when too many collisions start occurring (this means you'll have to think up a reasonable way to measure collisons, the sophistication of which will affect your score). (4 points)

  3. Add quadratic or double hash probing for your open hash table implementation in addition to linear probing. For double hashing, the second hash function may be of your choice. For more credit, try several and report on how well they work (using either hash statistics or profiling, as described below). (1 point per extra probe method, 2 points per extra probe method if statistics generated as below)

  4. Report hash table statistics when using a hash table representation. This extra-credit options is especially interesting when paired with the previous or next options. Remember that any statistics should be printed to stderr so as not to interfere with any frequency output. Even better, add a flag to turn statistic reporting on or off. (4 points)

  5. Profiling - Profile all three trees using gprof. You should run your trees on at least two texts: share/texts/a-words.txt in the course directory, and a text of your choice from Project Gutenberg. Your README should include a short paragraph with the following: (6 points) For even more fun, test the following as well.
  6. Does frequency analysis produce a good author signature? Find some other texts by Shakespeare or Bacon from Project Gutenberg, and write a couple paragraphs in your README responding to this question. (1 point)

  7. Word stemming is a process in which: So, a word-stemming word-frequency-counter would count the word "car" twice in the following sentence: "The red car hit the blue cars".

    Note that simply removing certain letters or strings at the end of words will not work: "Swiss" is not the plural of "Swis", nor is "bed" the past tense of "b". Simple word-stemming algorithms can be found on the Internet, so your best reference will probably be a search engine like Google

    You should only do word stemming if the -w flag is passed to your program, as we still want to be able to check the core functionality of your program.

    Word stemming algorithms are very complicated. Feel free to use code you find in the network in your algorithm, as long as you attribute any outside code used, in your README. (6 points)

Credits

This assignment is starting to become a fixture in 326. The first appearance of the snowy-bearded historian was in 326 winter 2000, taught by the famous Steve Wolfman. It has been incrementally improved over the years, most recently last quarter by Henry Kautz.