Practice Exercise #09

A FileReader object reads the tokens in a text file and counts how many times each occurs, except that it consults an exclusion list and ignores tokens on that list. Having done that, the FileReader remembers the top N most frequently occuring tokens (for some parameterizable N). FileReaders can take set intersections and differences with other FileReaderss. You'll be implementing the main functionality of the FileReader class.

We use FileReaders to determine whether "dog" is in fact much more common in the names of country songs than, say, reggae songs. I've taken the freedb database and extracted song titles from it, placing the titles in files whose names are the freedb song genres. Those data files are in directory ex09_files. The command

$ ./ex09_files/ 10 country reggae

will extract the 10 most common words in the titles of country and reggae songs, then show which words occur only in each of the two genres and then the words that occur in both:

country - reggae
  blues
  don't
  heart
  i'm
  your

reggae - country
  dub
  jah
  man
  no
  up

country intersect reggae
  i
  love
  me
  my
  you

You're provided with a complete mainline, ex09.cc, and a FileReader.h that implements the set operations portion of the FileReader. Your job is to implement the two unimplemented FileReader constructors. One simply populates its word set with all words in a file. The other populates its word set with the top N most frequent words, excluding those on the exclusion list.

You should pay some attention to running time for this assignment. Compiled with switches -Wall -std=gnu++0x -g, the sample solution took just under 10 seconds on attu to run the command above, as measured by time. (The load on attu varies a lot, so use that only as a guideline.)