CSE 143 Autumn 2001

Homework #7 - WordCount

This is an optional assignment.

Due: Electronic submission due by 10:00 PM, Monday, December 10.  Written report due in sections on Tuesday, December 11.


Purpose

This assignment is a chance to gain additional experience with container classes, particularly HashMaps, and with file and string processing.  This time you're on your own - you get to put the program together from scratch.

This assignment is entirely optional.  If you do it, you will receive extra credit, which has a good chance of boosting your final course grade at least a little.  If you don't do it, your final grade will not be lowered, regardless of how many other students do the assignment.  This assignment is good preparation for part of the final exam so, even if you chose not to submit it for credit, you may find it useful to think about the key algorithms and data structures.

As always, your code will be evaluated both for how well it meets the requirements and how well it is written - structure and comments/layout/clarity.  A short report is also required.

Word counting problem

One simple kind of statistical analysis performed on literary texts is counting the total number of words they contain. This number can be compared to the number of unique words in the text and make conclusions about the richness of vocabulary for example. A more sophisticated analysis would be to count the number of occurrences of each word, and rank words by their frequency of usage. Sophisticated forms of this analysis can be used for author identification, genre classification, stylistic comparison, etc.

The goal of this problem is to write a simple program, which reports the number of unique words appearing in a text file, and also lists the top N most frequently occurring words. The program should have two command line parameters (arguments passed to main()):

and should produce output on the console in a format like the following: ({"c:\\hamlet.txt", "5"})

  Processing text file c:\hamlet.txt...
  Total unique words: 4636
  Sorting word list...
  1. the : 1150
  2. and : 980
  3. to : 755
  4. of : 672
  5. i : 637

You can find numerous interesting texts at the Project Gutenberg site, or you can use any other input files you wish.

Implementation Requirements

There are two basic problems that need to be solved.  One is to read the words from the input file, one word at a time.  The other is to count the words as they are read, then sort and display the final results.  You should divide the problem up into two classes as follows:

WordReader Implementation Hints

The main issue here is how to break the input text into individual words.  There are at least two possible approaches.

  1. Read the input one line at a time using readLine().  Use methods in class String to extract individual words as substrings from the input line (methods like substring() and trim()).  There are also lots of useful methods in class Character for categorizing characters - isLetter and isWhitespace, for example. 
  2. The other possibility is to look at classes StringTokenizer and StreamTokenizer.  These are intended to make it "easy" to extract words (tokens) from strings or streams.  "Easy" is a bit relative -- it takes some time to learn how to use the tokenizer classes -- but once learned, it's easier than fiddling with substrings and classifying characters.

Which one to use is a matter of personal preference for this assignment.  There are useful things to learn and experiences to gain with either approach.  You may want to explore both of them to get a sense for which one works best for you before deciding on one or the other.

WordCount Implementation Hints

Written Report

If you turn in the assignment, you must turn in a short report that discusses your program, describes the design, and issues you encountered while working on it.  Your report should cover

  1. Planning and operation: How did you organize your code?  What design issues did you encounter?
  2. Testing: How did you test your code?  What sort of bugs did you encounter?  What works and what doesn't?  Are there any unresolved problems in the code?
  3. Evaluate this project.  What did you learn from it?  Was it worth the effort?  This could include things you learned about specifications and interfaces, design issues, Java language or library issues, debugging, etc.

A turnin form will be available online.  Use it to turn in your files by Monday, December 10, at 10:00 PM.  You do not need to hand in a printed copy.  Hand in your written report in section on Tuesday, December 11.