CSE 143 Autumn 2001

Homework #7 - WordCount

This is an optional assignment.

Due: Electronic submission due by 10:00 PM, Monday, December 10. Written report due in sections on Tuesday, December 11.

Purpose

This assignment is a chance to gain additional experience with container classes, particularly HashMaps, and with file and string processing. This time you're on your own - you get to put the program together from scratch.

This assignment is entirely optional. If you do it, you will receive extra credit, which has a good chance of boosting your final course grade at least a little. If you don't do it, your final grade will not be lowered, regardless of how many other students do the assignment. This assignment is good preparation for part of the final exam so, even if you chose not to submit it for credit, you may find it useful to think about the key algorithms and data structures.

As always, your code will be evaluated both for how well it meets the requirements and how well it is written - structure and comments/layout/clarity. A short report is also required.

Word counting problem

One simple kind of statistical analysis performed on literary texts is counting the total number of words they contain. This number can be compared to the number of unique words in the text and make conclusions about the richness of vocabulary for example. A more sophisticated analysis would be to count the number of occurrences of each word, and rank words by their frequency of usage. Sophisticated forms of this analysis can be used for author identification, genre classification, stylistic comparison, etc.

The goal of this problem is to write a simple program, which reports the number of unique words appearing in a text file, and also lists the top N most frequently occurring words. The program should have two command line parameters (arguments passed to main()):

the name of the text file containing the text to be analyzed and
the number N,

and should produce output on the console in a format like the following: ({"c:\\hamlet.txt", "5"})

  Processing text file c:\hamlet.txt...
  Total unique words: 4636
  Sorting word list...
  1. the : 1150
  2. and : 980
  3. to : 755
  4. of : 672
  5. i : 637

You can find numerous interesting texts at the Project Gutenberg site, or you can use any other input files you wish.

Implementation Requirements

There are two basic problems that need to be solved. One is to read the words from the input file, one word at a time. The other is to count the words as they are read, then sort and display the final results. You should divide the problem up into two classes as follows:

Class WordReader. This class should act as a filter stream. It should read input from a BufferedReader stream and should provide a method readWord() that returns a string containing the next word from the input each time it is called. The constructor and method are specified as follows:
```
/** Initialize a WordReader that reads from the given source stream */
public WordReader(BufferedReader source) { ... }

/** Return the next word from this WordReader or null if end of file */
public String readWord( ) { ... }
```
Individual words are separated by whitespace characters (tabs, returns, newlines). Ideally, method readWord() would strip out leading and trailing punctuation, so the input line
```
    He said, "The blasted thing works!!"
```
would be returned as the following sequence of words, one word for each call to readWord()
```
    He  said  The  blasted  thing  works
```
However, you can start simple and not worry about leading or trailing punctuation until you have the rest of the program working.
Class WordCount. This class should contain the main method for the program. When it starts, it should open a WordReader stream to get words from the input file. It should then read the words, count them, sort by frequency, and print the results as illustrated earlier. (Details and hints below.) Of course, you should create additional methods besides main and call them as needed. (Feel free to use static methods if it doesn't make sense to create complete objects.) One detail: words that differ only by capitalization should be considered the same word. So if "The" and "the" appear in a file, they should count as two occurrences of the word "the".

WordReader Implementation Hints

The main issue here is how to break the input text into individual words. There are at least two possible approaches.

Read the input one line at a time using readLine(). Use methods in class String to extract individual words as substrings from the input line (methods like substring() and trim()). There are also lots of useful methods in class Character for categorizing characters - isLetter and isWhitespace, for example.
The other possibility is to look at classes StringTokenizer and StreamTokenizer. These are intended to make it "easy" to extract words (tokens) from strings or streams. "Easy" is a bit relative -- it takes some time to learn how to use the tokenizer classes -- but once learned, it's easier than fiddling with substrings and classifying characters.

Which one to use is a matter of personal preference for this assignment. There are useful things to learn and experiences to gain with either approach. You may want to explore both of them to get a sense for which one works best for you before deciding on one or the other.

WordCount Implementation Hints

Store the <word, frequency> pairs in a HashMap. Each word read from the input becomes a key in the map, and the associated value is the number of times the word has been seen in the input text so far.
You will need to use class Integer to turn the integer frequency counts into Objects that can be stored as values in a HashMap. Integer intObj = new Integer(17); creates an Object version of 17; intObj.intValue() extracts the stored int value from the Integer object.
You should convert all words to lowercase letters before using them as a key in the HashMap. That ensures that different capitalizations of the same word are treated the same. The toLowerCase method of class String is ideal for this.
Once you've read the entire input file, you need to extract the <word, frequency> pairs, sort by frequency (and, if you're fancy, alphabetically by word in case of a tie), and print the most frequent pairs. This is where the standard library classes really pay off. For example, HashMap includes a method entrySet() that returns a Set containing all the <key,value> pairs in the HashMap. Set contain a toArray method that returns an array containing its elements. There are various sort methods floating around (see the index of the Javadoc pages), which might be useful (or might not). There are also intriguing classes named SortedMap and SortedSet, which have some interesting properties.You won't need to use all of these things, but, if you think about it a bit, you should find it very easy to convert the list of <word, frequency> pairs into a list of words sorted by frequency.

Written Report

If you turn in the assignment, you must turn in a short report that discusses your program, describes the design, and issues you encountered while working on it. Your report should cover

Planning and operation: How did you organize your code? What design issues did you encounter?
Testing: How did you test your code? What sort of bugs did you encounter? What works and what doesn't? Are there any unresolved problems in the code?
Evaluate this project. What did you learn from it? Was it worth the effort? This could include things you learned about specifications and interfaces, design issues, Java language or library issues, debugging, etc.

A turnin form will be available online. Use it to turn in your files by Monday, December 10, at 10:00 PM. You do not need to hand in a printed copy. Hand in your written report in section on Tuesday, December 11.