CSE 303 Homework 6 - The Application and Extensions

Application Overview

The application generates random lists of words that it hopes are English sentences. It does this by reading samples of sentences, written by some human author, and keeping statistics about what words follow next after each word. It then uses these statistics to generate lists of words probabilistically. With (a lot of) luck, what is generated is an English sentence. What's more, ideally it sounds like the author whose writings were used as "training" input.

The Data

There are two data files, both novels written by Knut Hamsum. The shorter one, Hunger, was written early in his career (and is one of my favorite novels of all time). The longer one, The Growth of the Soil, was written later in his career. There are two in the hope that comparing program output using each as input will produce sentences that mimic the stylistic changes that these two books, written 27 years apart, exhibit themselves.

Application Details

This application has much in common with the Google implementation of HW5. A lexer is used to strip words, one by one, from the input file. Each word W is put into a binary search tree. The node for W contains a list, each element of which points to a tree node (i.e., some word Y) that followed W in the training text. The linked list element also keeps a count of how often Y was observed immediately following W.

Application knut implements the above exactly. It generates random sentences by starting in a state that corresponds to the beginning of sentence. It then picks a next word based on the frequencies of the words that followed sentence start, and outputs it. It then picks a next word depending on the frequencies of what followed that first word, etc.

Application knut2 extends this idea slightly. Each node in the binary search tree now represents two consecutive words from the input text, say W Y. That tree node contains a list, each element of which points to some word Z that followed W Y in the input, as well as a count of how often that happened. When generating random sentences, if we've output W Y already, we pick a Z at random (according to input frequencies), then move to the node in the tree corrsponding to Y Z and repeat.