CSE 143 Autumn 2000

Homework 1

Due: Electronic submission by 10 pm, Wednesday 10/4. Paper receipt due in quiz section on Thursday 10/5.

Overview

For this assignment, write a program that reads a text file, counts the number of times each distinct word appears in the file, then prints a table of the most frequently occurring words, along with the number of times they appear. We will give you some advice and specify some of the data structures you are to use, but otherwise, you should create the program from scratch.

Concepts

The purpose of this assignment is to review some ideas from previous courses and explore basic parts of C++. Concepts:

Data structures with nested arrays and structs
Source file organization (.cpp and .h files)
C++ strings
C++ stream I/O and files
C++ reference parameters

Program Synopsis

The program should ask the user to enter the name of a text file, read and count the number of times each word appears in the file, ask the user how much word/frequency information to display, then display the requested number of entries, starting with the most common word(s), sorted in descending order of word frequency. If two or more words in the list appear the same number of times, they should be further sorted alphabetically. Example:

Suppose the file demo.txt contains the following text:

     It REALLY is #$!*&%? odd,
     this assignment, isn't it?

An execution of the program using that file as data should produce the following results (user input is in bold italics; everything else is generated by the program).

     Please enter file name: demo.txt
     How many word/frequency pairs do you want? 20
     2    it
     1    assignment
     1    is
     1    isnt
     1    odd
     1    really
     1    this

You can download this sample program and experiment on your own to see how it works.

Program Details

Capitalization is ignored. The words It, IT, it, and iT are considered to be the same.
Punctuation is ignored. This might not be exactly what a real word counting program should do (in English, its and it's mean quite different things), but for this assignment the program should ignore all punctuation marks (i.e., it's and its are both treated as being occurrences of its). "Punctuation" is defined for this assignment as any character c for which the <cctype> library function ispunct(c) returns true.
Any sequence of characters that consists entirely of punctuation is ignored (e.g., #$!*&%? in the above example).
You may assume that there are not more than 300 different words in the input. Your program does not need to handle the situation if the input file has more than 300 different words. (Of course, there may be many more than 300 total words in the input provided that at least some words appear more than once.) The sample program actually checks for too many different words in the input so it won't crash if there are too many, but you don't need to do this.
If the user requests more word/frequency pairs than actually exist, the program should print all the ones that are available (in the example, there are only 7 word frequency pairs printed, even though the user requested 20, because there are only 7 distinct words in the input).
The user needs to enter the full name of the input file (demo.txt in the example, not just demo). If a simple file name is entered, the file needs to be in the same folder as the Visual C++ project (or in the same folder as the demo program, if you're running the sample program). If the program is unable to open the file (most likely because it doesn't exist or its name was spelled incorrectly), it should report the error and terminate.

Implementation Requirements

One of the objectives of this assignment is to learn how to use multiple source files to separate parts of the program into logical groups. This program is fairly simple, but it should be partitioned into a couple of files.

The key data structure needed in this program is a list of word/frequency pairs. This list is an array of structs, along with a count of how many pairs are currently stored in the list. Here are C++ type definitions for this data structure. You must use this data structure to store the list of word/frequency pairs in your program.

     const int MAX_WORDS = 300;	// Max # distinct words in list

     struct WordInfo {           // information about a single word
        string word;             //    the word itself
        int    frequency;        //    # of times it has appeared so far
     };

     struct WordList {              // list of words and frequencies
        WordInfo entry[MAX_WORDS];  //    Word/frequency pairs are
        int	nWords;	           //    stored in list[0..nWords-1]
     };

Here's the key idea for splitting the program into logical pieces. You will need functions in your program to initialize the list to empty, record a new occurrence of a word in the list, sort the list, print part or all of it, and possibly other operations. All of these functions manipulate internal details of the WordList data structure.

The main program, on the other hand, should not depend on the exact layout or representation of a WordList. The main program should include a WordList variable to store the word/frequency information, then call appropriate functions as needed to add words, sort, print, and so forth.

All of the functions that access internal details of the WordList data structure should be placed in a separate source file. No functions in other files should depend on any of the internal implementation details. You'll also need to have an appropriate header file containing the above data definitions along with function prototypes for functions that manipulate WordLists. This header file will be #included in both the file containing the WordList functions, and in the file that contains the main program.

Other Implementation Requirements

Use C++ stream I/O (cin, cout, ifstream, etc.). Do not use printf/scanf/fscanf or other C library functions for I/O.
Use the C++ string library (#include <string>), not the C library functions strcpy, strcmp, and so forth.
Use C++ reference parameters as needed. Do not use explicit pointers.
Your code should be clear and readable. See the style guide for detailed suggestions.

Implementation Hints

You can use tab characters ('\t') to line up the output in columns. The sample program prints each frequency/word pair with a single tab character in between.
Remember to use #include "file.h" to include header files that are part of your code. The angle brackets in #include <library> are used to indicate libraries that are part of Visual C++ or other implementations.

Electronic Submission

When you've finished your program, turn it in using this turnin form. Print out the receipt that appears, staple it, and hand it in during quiz section.