Application Overview
The application generates random lists of words that it hopes are English sentences. It does this by reading samples of sentences, written by some human author, and keeping statistics about what words follow next after each word. It then uses these statistics to generate lists of words probabilistically. With (a lot of) luck, what is generated is an English sentence. What's more, ideally it sounds like the author whose writings were used as "training" input.The Data
There are two data files, both novels written by Knut Hamsum. The shorter one, Hunger, was written early in his career (and is one of my favorite novels of all time). The longer one, The Growth of the Soil, was written later in his career. There are two in the hope that comparing program output using each as input will produce sentences that mimic the stylistic changes that these two books, written 27 years apart, exhibit themselves.Application Details
This application has much in common with the Google implementation of HW5. A lexer is used to strip words, one by one, from the input file. Each wordWis put into a binary search tree. The node forWcontains a list, each element of which points to a tree node (i.e., some wordY) that followedWin the training text. The linked list element also keeps a count of how oftenYwas observed immediately followingW.Application
knutimplements the above exactly. It generates random sentences by starting in a state that corresponds to the beginning of sentence. It then picks a next word based on the frequencies of the words that followed sentence start, and outputs it. It then picks a next word depending on the frequencies of what followed that first word, etc.Application
knut2extends this idea slightly. Each node in the binary search tree now represents two consecutive words from the input text, sayW Y. That tree node contains a list, each element of which points to some wordZthat followedW Yin the input, as well as a count of how often that happened. When generating random sentences, if we've outputW Yalready, we pick aZat random (according to input frequencies), then move to the node in the tree corrsponding toY Zand repeat.