CSE 143 Autumn 2000

Homework 2

Due: Electronic submission by 10 pm, Wednesday, Oct. 11. 
Paper receipt due in quiz section on Thursday Oct 12.

 

Overview

For this assignment, convert Homework 1 into a C++ program that includes classes and member functions for the main data structure (the WordList). In addition, change the main program so it reads an HTML file instead of a plain text file, but only counts ordinary words in the file, ignoring HTML tags, comments and special symbols.

Concepts

The purpose of this assignment is to gain experience with the following new concepts:

Program Synopsis

The program should ask the user to enter the name of an HTML file, read and count the number of times each plain word appears in the file, ask the user how much word/frequency information to display, then display the requested number of entries, starting with the most common word(s), sorted in descending order of word frequency.  If two or more words in the list appear the same number of times, they should be further sorted alphabetically. HTML tags should be ignored, as should HTML comments and special symbols.  The .html file should be read from the local disk - you are not expected to write a program that opens an http connection to a web server and downloads the file.  

Example:  Suppose the file test.html contains the following text:

<html>
This is a sample HTML
file.  Isn't it similar
<!-- this is a comment -->
to a normal file with a
few &special; things that
are new. Like < this is not a horse >
tags.
</html>

An execution of the program using that file for input should produce the following results (user input is in bold italics; everything else is generated by the program).

Please enter file name: test.html
How many word/frequency pairs do you want? 100
Total Number of Words: 19
3       a
2       file
1       are
1       few
1       html
1       is
1       isnt
1       it
1       like
1       new
1       normal
1       sample
1       similar
1       tags
1       that
1       things
1       this
1       to
1       with

You can download this sample program and experiment on your own to see how it works.

Program Details

  1. This program is an extension of Homework 1. Information in the Program Details section of that assignment applies to  this one (rules for capitalization, punctuation, maximum number of distinct words in the input file, etc.)
  2. HTML tags are ignored. An HTML tag is any amount of text that begins with '<' and ends with '>' (not counting comments as defined below). For example: <html>, < this is a tag >, and </title> are all tags and should be ignored.  There may be whitespace between the brackets and the text they bracket, or there may be no extra spaces: <html> and <  this is a tag  >   should both be ignored.
  3. HTML comments are ignored. An HTML comment begins with <!-- and ends with --> Anything between those strings (including other tags) should be ignored. For example: <!--orange-->, <!-- this is ignored -->, and <!-- so is <all of this> here --> are three separate comments.  Notice that a comment may contain things that look like HTML tags and plain words.  Also, there may be spaces separating <!-- and --> from the text they surround, or there may be no extra spaces.
  4. HTML special characters are ignored. HTML special characters are strings that begin with '&' and end with ';' For example: &gt; &lt; &special; should all be ignored.
  5. You do not have to deal with tricky cases like tags within tags, mismatched < >, wo<rd or other exceptional cases. Assume the input is formatted so as to correspond to the above definitions of tags, comments and special characters (remembering that a tag can be in a comment, but not vice versa).
  6. In particular, assume that in any file you read in, there is a space ' ' before any '<' character and a space ' ' after any '>' character.

Implementation Requirements

A key objective of this assignment is to gain experience with C++ classes.  You are required to replace the WordList structure from Homework 1 with a proper C++ WordList class.  This class should include member functions that perform appropriate operations on WordLists.   Be sure that the representation of a WordList is private, and not accessible outside member functions of the WordList class.  Any member functions that are not part of the public interface should also be private.  Create an appropriate header file containing the class declaration and a companion C++ source file that contains the implementation of the WordList member functions.

Among other private data, the WordList class will contain an array of word/frequency pairs.  These pairs can be implemented with a struct, as in program 1, or, if you wish, converted to a class.  However these pairs should remain a simple data structure, not a complicated class with lots of member functions. 

Other Implementation Requirements

Implementation Hints

Electronic Submission

When you've finished your program, turn it in using this turnin form.  Print out the receipt that appears, staple it, and hand it in during quiz section.