CSE 333 21wi Exercise 6

out: Monday, February 1, 2021
due: Friday February 5, 2021 by 10:00 am

Exercise Goals

You will be modifying C++ code you won't really understand. You'll be modifying a makefile without fully understanding them. Take that you can do this as a sign of your rapidly evolving maturity as a programmer. Everything you've learned will easily prepare you to complete this exercise without undue pain.

Setup

Do a git pull and find directory ex06/. When you modify C++ code, it will be in file word-pair-frequency.cc, which starts as a copy of word-frequency.cc.

Part A: Filter App

An n-gram is a sequence of n consecutive symbols from text or speech. File word-frequency.cc is a short C++ program that reads "words" from stdin, counts the number of times that each occurs, and then prints a list of all words encountered and the percentage of all occurences by each.

Issue the command make run in the ex06 directory to run it.

word-frequency.cc does not try very hard to follow any normal definition of "word." For instance, "And" and "and" are distinct words. DO NOT spend time modifying the program to do a better job. (Or, more accurately, do not do that as part of completing this exercise, but if you want to improve it outside the exercise that would be a reasonable and managable additional experience.)

The application also does not try very hard to print its output in a reasonable order.

Starting with word-frequency.cc, build an application that reads text from stdin, counts the number of occurences of each word, and then prints to stdout the top 25 most frequently occuring words in descending order of frequency and the percentage of all words each represents. DO NOT modify any C++ code to add this functionality. Instead, modify the supplied makefile.

We expect that the command make run will build and run your code that will print the top 25 most occuring words read from stdin For this exercise, make run is "your application."

Hints

Read the man pages for the following:

Part B: Word Pairs

Modify file word-pair-frequency.cc so that it reports on 2-grams -- pairs of consecutive words. (Supplied file word-pair-frequency-solution prints word-pair statistics, for reference. It does not order them or limit its output, though.)

Enhance the makefile so that command make run-pair builds and runs your word pair code in a way analogous to what make run does for word-frequency.cc (top 25 most occurring word pairs, in order).

Hints

Turn-In

You should tag your files ex06-final and push to your repository.

The Data Sets

The data sets we use as inputs come from anc.org, The American National Corpus. Our file all-written.txt is the concatentation of all the files from the MASC data set that come from written sources, and our file all-spoken.txt the concatenation of all files that are transcriptions of speech.

Copies of our files are in attu:/cse/courses/cse333/21wi/public/anc-masc-data/. If you run on attu using the makefile, the input will be sourced from those files. If you run somewhere else, you can copy the input files from attu, or use your own input files.