Practice Exercise #10

Implement a C++ program that uses N-gram models to synthesize English language sentences. Your program must construct one or more (likely cyclic) graphs whose nodes are connected using shared_ptrs. We provide a sample executable whose functionality you should mimic, but other than that the design of the implementation is entirely up to you.

Your application works in two phases. In the first, it constructs an N-gram model of word sequences in English sentences. An N-gram is a sequence of (up to) N consecutive words. Our N-gram model can be represented as a directed graph. There is a distinguished start state, corresponding to the start of a sentence. Other nodes represent N-grams. An edge from one node to another indicates that the second N-gram may immediately follow the first.

You build an N-gram model by processing some sample text. For instance, suppose we process this text:

  My many dogs have many fleas I think.
  My many fleas have many dogs I think.

The 1-gram model for that text might look like this:

and the 2-gram model like this:

The edge labels include counts of the number of times that edge was traversed when processing the input text.

These are logical depictions. Your implementation needn't associate information with edges and nodes in exactly this way. But, you must produce a graph of this sort, and the edges must be shared_ptrs to node objects.

The second phase is producing synthetic sentences using the model. To do that, you start in the start state. While in a non-terminal state, you choose an outgoing edge at random, with probability proportional to the count of that edge. For instance, in the 2-gram model above, while in state manydogs you'd choose successor states dogsI and dogshave with equal probability. Each time you traverse an edge you output the token that labels that edge.

Important Details

You can run the sample executable on attu like this:

$ cd /cse/courses/cse333/13wi/ex12_files
$ ./ex12 3 ./datafiles
Constructing model austen
Constructing model carroll
Constructing model dickens
Constructing model doyle
Constructing model grimm
Constructing model howard
Constructing model hugo
Constructing model joyce
Constructing model kafka
Constructing model melville
Constructing model shelley
Constructing model twain
Constructing model whitman

Enter model name to generate sentence using that model,
list for a list of models, or exit to exit: hugo

   Caesar and Tacitus are two successive phenomena, a meeting between whom seems to be mysteriously avoided, by the One who, when He sets the centuries on the stage, to say revolt now and then, but merely to distinguish superficial facts, and always preserving the distinction between revolt, the form, and insurrection, the foundation. 

Enter model name to generate sentence using that model,
list for a list of models, or exit to exit: exit

The command caused the application to construct 3-gram models from all xxx.txt files located in directory ./datafiles. A separate 3-gram model was constructed for each individual text file. In this case, the files correspond to the works of the authors they're named after, as available at Project Gutenberg. Once the models are constructed, you can request a new sentence in the style of any of the authors.

While we're not too picky about the interface, there are some things your code must do:

it must create a separate model for each file in the directory whose name is of the form xxxx.txt
it must maintain those models, and allow the user to repeatedly create sample sentences from any of them
it must cleanly destroy the models when the user exits the application
it must not keep some master list of all the nodes in a model and delete the model using that master list. Once you enter the sentence construction phase, each model object should contain a pointer to its start node, but the only other pointers (or references) to nodes should be in the nodes themselves. (This forces you to do some sort of recursive deletion of the nodes.)

Notes:

We define a sentence to end with the first token that itself ends with a period.
There is a 0-gram model, consisting of only the start state. It produces terrible synthetic sentences. The sample solution chooses not to implement 0-gram models; your code can, if you'd like, but you don't need to either.
Reminder: you must implement a shared_ptr-based graph as the central data structure, even though there are other, arguably more sensible, structures that could be used instead.
Reminder: you must delete the graph before terminating, so that you get a completely clean valgrind report. Figuring out how to do that is a little tricky, but is purposefully left to you.
Time and space are both important. They're not the primary point, but I suspect you could have trouble with either or both if you are careless and the input data is large enough or the system you're using is small enough. Avoid terrible design decisions.