![]() |
CSE 143 Autumn 2001Homework #7 - WordCountThis is an optional assignment.Due: Electronic submission due by 10:00 PM, Monday, December 10. Written report due in sections on Tuesday, December 11. |
This assignment is a chance to gain additional experience with container classes, particularly HashMaps, and with file and string processing. This time you're on your own - you get to put the program together from scratch.
This assignment is entirely optional. If you do it, you will receive extra credit, which has a good chance of boosting your final course grade at least a little. If you don't do it, your final grade will not be lowered, regardless of how many other students do the assignment. This assignment is good preparation for part of the final exam so, even if you chose not to submit it for credit, you may find it useful to think about the key algorithms and data structures.
As always, your code will be evaluated both for how well it meets the requirements and how well it is written - structure and comments/layout/clarity. A short report is also required.
One simple kind of statistical analysis performed on literary texts is counting the total number of words they contain. This number can be compared to the number of unique words in the text and make conclusions about the richness of vocabulary for example. A more sophisticated analysis would be to count the number of occurrences of each word, and rank words by their frequency of usage. Sophisticated forms of this analysis can be used for author identification, genre classification, stylistic comparison, etc.
The goal of this problem is to write a simple program, which reports the
number of unique words appearing in a text file, and also lists the top N
most frequently occurring words. The program should have two command line
parameters (arguments passed to main()
):
and should produce output on the console in a format like the following: ({"c:\\hamlet.txt", "5"}
)
Processing text file c:\hamlet.txt... Total unique words: 4636 Sorting word list... 1. the : 1150 2. and : 980 3. to : 755 4. of : 672 5. i : 637
You can find numerous interesting texts at the Project Gutenberg site, or you can use any other input files you wish.
There are two basic problems that need to be solved. One is to read the words from the input file, one word at a time. The other is to count the words as they are read, then sort and display the final results. You should divide the problem up into two classes as follows:
WordReader
. This class should act as a filter stream. It
should read input from a BufferedReader
stream and should provide a method
readWord()
that returns a
string containing the next word from the input each time it is called.
The constructor and method are specified as follows:
/** Initialize a WordReader that reads from the given source stream */ public WordReader(BufferedReader source) { ... } /** Return the next word from this WordReader or null if end of file */ public String readWord( ) { ... }Individual words are separated by whitespace characters (tabs, returns, newlines). Ideally, method
readWord()
would strip out leading and
trailing punctuation, so the input line
He said, "The blasted thing works!!"would be returned as the following sequence of words, one word for each call to
readWord()
He said The blasted thing worksHowever, you can start simple and not worry about leading or trailing punctuation until you have the rest of the program working.
WordCount
. This class should contain the main method for the
program. When it starts, it should open a WordReader
stream to
get words from the input file. It should then read the words, count them,
sort by frequency, and print the results as illustrated earlier. (Details and hints below.) Of course, you should create
additional methods besides main and call them as needed. (Feel free to
use static methods if it doesn't make sense to create complete
objects.) One detail: words that differ only by capitalization should
be considered the same word. So if "The" and "the"
appear in a file, they should count as two occurrences of the word
"the".The main issue here is how to break the input text into individual words. There are at least two possible approaches.
readLine()
. Use methods in
class String
to extract individual words as substrings from the input line
(methods like substring()
and trim()
). There are also lots of useful
methods in class Character for categorizing characters - isLetter
and isWhitespace
, for example. StringTokenizer
and
StreamTokenizer
. These are intended to make it "easy" to
extract words (tokens) from strings or streams. "Easy" is a bit
relative -- it takes some time to learn how to use the tokenizer classes -- but once learned,
it's easier than fiddling with substrings and classifying characters.Which one to use is a matter of personal preference for this assignment. There are useful things to learn and experiences to gain with either approach. You may want to explore both of them to get a sense for which one works best for you before deciding on one or the other.
HashMap
. Each word read
from the input becomes a key in the map, and the associated value is the
number of times the word has been seen in the input text so far. Integer
to turn the integer frequency counts
into Objects
that can be stored as values in a HashMap
.
Integer intObj
= new Integer(17);
creates an Object
version of 17; intObj.intValue()
extracts the stored int
value from the Integer
object.HashMap
. That ensures that different capitalizations
of the same word are treated the same. The toLowerCase
method of
class String
is ideal for this.
HashMap
includes a method entrySet()
that returns a Set
containing all the
<key,value> pairs in the HashMap
. Set
contain a
toArray
method
that returns an array containing its elements. There are various sort
methods floating around (see the index of the Javadoc pages), which might be
useful (or might not). There
are also intriguing classes named SortedMap
and SortedSet
, which have some
interesting properties.You won't need to use all of these things, but, if
you think about it a bit, you should find it very easy to convert the list
of <word, frequency> pairs into a list of words sorted by frequency.If you turn in the assignment, you must turn in a short report that discusses your program, describes the design, and issues you encountered while working on it. Your report should cover
A turnin form will be available online. Use it to turn in your files by Monday, December 10, at 10:00 PM. You do not need to hand in a printed copy. Hand in your written report in section on Tuesday, December 11.