out: Monday, February 1, 2021
due: Friday February 5, 2021 by 10:00 am
You will be modifying C++ code you won't really understand. You'll be modifying a makefile without fully understanding them. Take that you can do this as a sign of your rapidly evolving maturity as a programmer. Everything you've learned will easily prepare you to complete this exercise without undue pain.
Do a git pull
and find directory ex06/
.
When you modify C++ code, it will be in file word-pair-frequency.cc
,
which starts as a copy of word-frequency.cc
.
An n-gram is a sequence of n consecutive symbols from text or speech.
File word-frequency.cc
is a short C++ program that reads "words"
from stdin
, counts the number of times that each occurs, and then prints a list
of all words encountered and the percentage of all occurences by each.
Issue the command make run
in the ex06
directory to run it.
word-frequency.cc
does not try very hard to follow any normal definition of "word." For instance,
"And" and "and" are distinct words. DO NOT spend time modifying the
program to do a better job. (Or, more accurately, do not do that as part of completing this exercise,
but if you want to improve it outside the exercise that would be a reasonable and managable additional
experience.)
The application also does not try very hard to print its output in a reasonable order.
Starting with word-frequency.cc
, build an application that reads text from stdin
,
counts the number of occurences of each word, and then prints to stdout
the top 25 most frequently occuring
words in descending order of frequency
and the percentage of all words each represents.
DO NOT modify any C++ code to add this functionality. Instead, modify the supplied makefile
.
We expect that the command make run
will build and run your code that will print the
top 25 most occuring words read from stdin
For this exercise, make run
is "your application."
Read the man pages for the following:
sort
, and its -r
, -g
, and -k
command line switcheshead
, and its -n
command line switch
Modify file word-pair-frequency.cc
so that it reports on 2-grams -- pairs of consecutive words.
(Supplied file word-pair-frequency-solution
prints word-pair statistics, for reference. It does
not order them or limit its output, though.)
Enhance the makefile
so that command make run-pair
builds and runs your word pair
code in a way analogous to what make run
does for word-frequency.cc
(top 25 most
occurring word pairs, in order).
You should tag your files ex06-final
and push to your repository.
The data sets we use as inputs
come from anc.org, The American National Corpus.
Our file all-written.txt
is the concatentation of all the files from the MASC data
set that come from written sources, and our file all-spoken.txt
the concatenation of all
files that are transcriptions of speech.
Copies of our files are in attu:/cse/courses/cse333/21wi/public/anc-masc-data/
. If
you run on attu
using the makefile
, the input will be sourced from those
files. If you run somewhere else, you can copy the input files from attu
, or use
your own input files.