handout #5

CSE341—Programming Languages

ML Programming Assignment #4

due 11 pm Thursday, 4/29/10

In this assignment you will write functions that analyze text to convert it into a series of n-grams and that use those n-grams to generate random text that has the same n-gram properties as the original.  As the name implies, an n-gram is something that depends on a certain choice of n.  So we can have 2-grams, 3-grams, 4-grams, and so on.  The idea is to count the occurrences of word sequences of length n in a text.  For example, if you were to look at the 2-grams of Hamlet, you’d find that that the words “To be” occur 6 times in the text.  So we would say that one of the 2-grams of the text is (“To”, “be”, 6).  It turns out that none of these six occurrences is part of the famous soliloquy, because that one has a comma after “be,”.  The decision about what characters to include as part of the word is rather  arbitrary, but for our purposes, we will include the comma, which leads to another 2-gram with an occurrence of 1: (“To”, “be,”, 1).  In other words, there is only one occurrence of that pair of tokens in the input file.

We will once again use the utility functions defined in utility.sml.  You will also need a file called ngram.sml that has supporting functions for this assignment.  You should include all of your  answers in a file called hw4.sml.

It would seem natural to store an n-gram as a list of n words along with an integer count.  Unfortunately, that won’t be a useful data structure for us when we want to generate random text.  Instead, we’re going to use the first (n – 1) words as a key that will be used to look up an entry in a map structure.  In the remainder of this specification, the term “leading words” will refer to a set of (n – 1) words being used as a key and the term “completion word” will refer to one of the possible completions for those (n – 1) words.  Each combination of a set of leading words with a completion word will represent a single n-gram.

We will implement it in such a way that we can quickly look up the entry for any set of leading words.  For each set of leading words, we’ll keep a list of word/count pairs (one for each completion word).  So each one of our entries can represent several different n-grams.  For example, suppose that we were computing the 3-grams of Hamlet.  Then the entries in our map involve keys of two words.  Consider the two word list [“do”, “you”].  There are 7 different 3-grams that begin with those two words (in other words, there are 7 different completions).  Below is the list of the 7 completions with their frequencies:

[("think",3),("speak",1),("look?",1),("go",1),("call",1),("mark",1),
 ("read,",1)]

Using this list, we know that the sequence [“do”, “you”, “think”] occurs 3 times in the text, the sequence [“do”, “you”, “speak”] occurs 1 time, the sequence [“do”, “you”, “look?”] occurs 1 time, and so on.

The file ngram.sml includes the following type definition for an ngram:

datatype ngram = Ngram of string list * int * (string * int) list;

Each ngram has three components: a list of leading words, a total count for this set of leading words, and a list of word/count frequencies for all completion words that follow these leading words in the original text.  For example, the ngram for [“do”, “you”] looks like this:

Ngram

    (["do","you"],9,

     [("think",3),("speak",1),("look?",1),("go",1),("call",1),("mark",1),

      ("read,",1)]) : ngram

The number 9 is the sum of all of the frequencies in the word/count list.  Here is another example from Hamlet using the words [“It”, “is”]:

Ngram

    (["It","is"],19,

     [("a",4),("here,",1),("the",2),("but",1),("indifferent",1),("our",1),

      ("not",2),("back'd",1),("as",1),("an",1),("'Adieu,",1),("not,",1),

      ("most",1),("offended.",1)]) : ngram

The file ngrams.sml  includes a function called getWords that can be used to get the leading words from an ngram.

As mentioned earlier, we want to have fast lookup for any given set of leading words.  We will accomplish that by storing the ngrams in a binary search tree.  The file ngrams.sml has the following definition for the tree:

datatype ngramTree = Empty | Node of ngram * ngramTree * ngramTree;

In all of the problems that follow we will assume that we are working with legal ngrams and legal ngram trees.  In particular, we will assume that we are working with “2 or more”-grams, that the overall frequency included in an ngram is equal to the sum of the individual frequencies of the various completion words, that no completion words have a frequency of 0, that the list of completion words is nonempty, that the ngram tree is a proper binary search tree based on the leading word lists, and that the ngram tree is nonempty.  We will also assume that data passed to our functions are consistent in terms of their ngram length.  For example, we will assume that our functions won’t be asked to include both a 2-gram and a 3-gram in the same ngram tree.  We will make one additional assumption that the ngram tree contains at least one ngram that begins with a capitalized word.  The reason will become obvious when you work on problem 7.

You are not allowed to solve these problems using arrays, references or vectors but otherwise you are allowed to use the full range of ML constructs and you can call functions from the basis library.  You might find the libraries List and Char particularly helpful.

1.      (5 points) Define a function stringListCompare(list1, list2) that takes two string lists as arguments and that returns a value of type order indicating how the  lists compare:

val stringListCompare = fn : string list * string list -> order

Type order is described on page 325 of the textbook.  Your function should compare the lists to find the first pair of words that differ.  If there is no such pair, your list should return EQUAL.  If there is a pair that differs, your function should return LESS if the word from list1 is less than the word from list2 and should return GREATER if the word from list1 is greater than the word from list2.  You may assume the lists are of equal length, although it’s not a bad idea to raise an exception if you find yourself running out of data in one list but not the other.

2.      (10 points) Define a function addTo(ng, w) that takes an ngram and a completion word (a string) as arguments and that returns a new ngram that includes the given word in the list of completions:

val addTo = fn : ngram * string -> ngram

If the word appears in the list of completions in the original ngram, then the new ngram should record that its frequency is 1 more than in the old ngram.  If the word does not appear in the original ngram, it should be added to the list with a frequency of 1.

3.      (10 points) Define a function insert(words, word, tree) that takes a list of leading words (strings), a completion word (string) and an ngram tree and that returns the tree obtained by inserting the leading words/completion word combination into the tree:

val insert = fn : string list * string * ngramTree -> ngramTree

If the set of leading words is already included in the tree, then the completion word should be added to that ngram.  If not, then a new ngram should be added to the tree with a frequency of 1.  The stringListCompare function should be used to order the tree.

4.       (5 Points) Define a function insertList(list) that takes a list of (leading words, completion word) tuples (string list * string tuples) and that returns an ngram tree built by inserting each combination into an initially empty ngram tree (the combinations can be added in any order):

val insertList = fn : (string list * string) list -> ngramTree

5.       (10 points) Define a function find(words, tree) that takes a list of leading words (strings) and an ngram tree as parameters and that returns an ngram option that is either NONE (words not found) or is the ngram associated with that set of leading words in the given search tree:

val find = fn : string list * ngramTree -> ngram option

6.      (10 Points) Define a function groupWords(n, list) that takes an integer n and a list of words (strings) and that returns a list of all (leading words, completion word) tuples from the list:

val groupWords = fn : int * 'a list -> ('a list * 'a) list

The function should use the first n words to form a tuple, then the n words starting with the second word, then the n words starting with the third word, and so on until it finds there aren’t n words left to form a tuple.  For example, groupWords(3, ["Twas", "brillig", "and", "the", "slithy", "toves"]) should return [(["Twas", "brillig"], "and"), (["brillig", "and"], "the"), (["and", "the"], "slithy"), (["the", "slithy"], "toves")].  Notice that if the list is of length m, then the function returns (m – n + 1) tuples.  You may assume that n is at least 2.  If the list does not have at least n words, your function should return an empty list.

7.      (15 Points) Define a function pickRandom(tree) that takes a legal ngram tree as an argument and that randomly picks a set of leading words where the first word begins with a capital letter:

val pickRandom = fn : ngramTree -> string list

Only some of the ngrams in the tree will begin with capitalized words.  Your function must pick randomly among them with each possibility having equal probability of being chosen.  You might want to call the function randomInt() that is part of the utility class and that returns a random int value.  Because this function has to pick from the ngrams that begin with capitalized words, it is allowed to run in O(n) time where n is the size of the ngram tree.  Remember that you can assume that the tree will contain at least one ngram that begins with a capitalized word.

8.      (10 Points) Define a function randomWord(ng) that takes a legal ngram as an argument and that picks a random completion word from the list of completion words, preserving the statistical frequency of the word:

val randomWord = fn : ngram -> string

For example, suppose that we pass the function the ngram mentioned in the introduction:

Ngram

    (["It","is"],19,

     [("a",4),("here,",1),("the",2),("but",1),("indifferent",1),("our",1),

      ("not",2),("back'd",1),("as",1),("an",1),("'Adieu,",1),("not,",1),

      ("most",1),("offended.",1)]) : ngram

This set of 2 leading words has a total frequency of 19 with various completion words.  Your function should return the word “a” with a frequency of 4/19 (four nineteenths of the time, which is around 21% of the time).  The word “here” should be returned 1/19 of the time.  The word “the” should be returned 2/19 of the time.  And so on.  By picking the words in this manner, we will ensure that the text that we build up will have approximately the same ngram statistics at the original text.

9.       (5 Points) Define a function buildTree(n, fileName) that takes an integer n and file name (a string) as arguments and that returns an ngram tree constructed using the words from the given input file with ngrams of length n (i.e., having (n – 1) leading words and a set of completion words):

val buildTree = fn : int * string -> ngramTree

You can call the function read included in ngram.sml to convert the file into a list of tokens.  You may assume that the given file exists and is readable and that n is at least 2.

10.   (20 Points) Define a function buildList(count, tree) that takes an integer count and an ngram tree as parameters and that returns a list of approximately count words that are randomly generated using the given tree:

val buildList = fn : int * ngramTree -> string list

Below is a pseudocode description of how the list should be constructed:

Use pickRandom to pick a random set of leading words that begins with a capitalized word.

 

Include the first word from the current set of leading words in the result.

 

while (not done) {

“Shift” the leading words by dropping the first word and adding a completion word obtained by calling randomWord on the current set of leading words.

 

Include the first word from the current set of leading words in the list.

}

The list of words that you generate should end with a word that could be considered the last word of a sentence (a word ending with a period, question mark or exclamation mark).  You will find the function sentenceEnd that is included in ngram.sml helpful to accomplish this task.  It’s possible that you will be able to generate exactly count words, but more often you will have to generate extra words so that you can also satisfy the constraint that you end with a word that can end a sentence.  Your function should stop generating words once the two constraints have been satisfied (at least count words, ends with a word that can end a sentence).

For example, suppose that you were working with 3-grams generated from Tom Sawyer and you are asked to build a list that has two words.  The sequence of choices might go like this:

Action

Current window

Current result

call pickRandom
include first word

["But", "all"]

["But"]

shift
include first word

["all", "things"]

["But", "all"]

shift
include first word

["things", "else."]

["But", "all", "things"]

shift
include first word

["else.", "At"]

["But", "all", "things", "else."]

In this case, we had to pick four words before we can stop because it took that long to find a word that can end a sentence.  Notice that we always have a two-word window.  As we proceed, we move the front of this list to the result and pick a new word to go at the end of the window.

There is one special case you are required to handle.  It will occasionally happen that the most recent set of leading words selected is not in the ngram tree.  This happens when you end up using the last (n – 1) words of the original text.  When that occurs, you should add each of these words one at a time to the list until you either run out of words or satisfy the two ending constraints.  For our purposes, we will assume that the last word of the file can end a sentence even if it doesn’t end with a period, question mark or exclamation mark (if these words were good enough to end the original text, then they’re good enough to end our random text).  If picking those words does not meet your quota (i.e., leaves you with fewer than count words), then after including those words, pick another random starting point by choosing a set of leading words that begin with a capitalized word and continue the process from there.

Once you have these functions done, you will probably want to construct a tree of ngrams for a particular text, then generate a random list of words and then call the printList function from ngram.sml to display the results, as in:

- val t = buildTree(3, "hamlet.txt");

- val list = buildList(100, t);

- printList(list);

Horatio, and Marcellus. Fran. I think nothing, my lord. Ham. Arm'd, say

you? Both. Arm'd, my lord. Ham. I will be brief. Your noble son is mad. Mad

call I it; for, to define true madness, What is't but to one table. That's

the end. King. Alas, alas! Ham. A goodly one; in which there are no truant.

But what we do determine oft we break. Purpose is but a shadow. Ros. Truly,

and I behind an arras then. Mark the encounter. If he but blench, I know

not. Is it not monstrous that this folly douts it. Exit. King.

 

val it = () : unit

-