cse341, handout #9

handout #9

CS341—Programming Languages

ML Programming Assignment #4

due 11 pm Wednesday, 1/31/07

In writing this problem set you will need a set of utility functions defined in a file called utility.sml available from the class web page. This file has been updated since assignment 3, so you should obtain the latest copy (your old programs should still work with this new version). You will also need a file called ngram.sml that has supporting functions for this assignment.

In this assignment you will write functions that analyze text to convert it into a series of n-grams and that use those n-grams to generate random text that has the same n-gram properties as the original. As the name implies, an n-gram is something that depends on a certain choice of n. So we can have 2-grams, 3-grams, 4-grams, and so on. The idea is to count the occurrences of word sequences of length n in a text. For example, if you were to look at the 2-grams of Hamlet, you’d find that that the words “To be” occur 6 times in the text. So we would say that one of the 2-grams of the text is (“To”, “be”, 6). It turns out that none of these six occurrences is part of the famous soliloquy, because that one has a comma after “be,”. The decision about what characters to include as part of the word is rather arbitrary, but for our purposes, we will include the comma, which leads to another 2-gram with an occurrence of 1: (“To”, “be,”, 1). In other words, there is only one occurrence of that pair of tokens in the input file.

It would seem natural to store an n-gram as a list of n words along with an integer count. Unfortunately, that won’t be a useful data structure for us when we want to generate random text. Instead, we’re going to use (n – 1) words as key that will be used to look up entry in a map structure. We will implement it in such a way that we can quickly look up the entry for any set of (n – 1) words. For each set of (n – 1) words, we’ll keep a list of word/count pairs. So each one of our entries can represent several different n-grams. For example, suppose that we were computing the 3-grams of Hamlet. Then the entries in our map involve keys of two words. Consider the two word list [“do”, “you”]. There are 7 different 3-grams that begin with those two words (in other words, there are 7 different completions). Below is the list of the 7 completions with their frequencies:

[("think",3),("speak",1),("look?",1),("go",1),("call",1),("mark",1),
("read,",1)]

Using this list, we know that the sequence [“do”, “you”, “think”] occurs 3 times in the text, the sequence [“do”, “you”, “speak”] occurs 1 time, the sequence [“do”, “you”, “look?”] occurs 1 time, and so on.

The file ngram.sml includes the following type definition for an ngram:

datatype ngram = Ngram of string list * int * (string * int) list;

Each ngram has three components: a list of (n – 1) words, a total count for this set of (n – 1) words, and a list of word/count frequencies for all words that follow those (n – 1) words in the original text. For example, the n-gram for [“do”, “you”] looks like this:

Ngram

(["do","you"],9,

[("think",3),("speak",1),("look?",1),("go",1),("call",1),("mark",1),

("read,",1)]) : ngram

The number 9 is the sum of all of the frequencies in the word/count list. Here is another example from Hamlet using the words [“It”, “is”]:

Ngram

(["It","is"],19,

[("a",4),("here,",1),("the",2),("but",1),("indifferent",1),("our",1),

("not",2),("back'd",1),("as",1),("an",1),("'Adieu,",1),("not,",1),

("most",1),("offended.",1)]) : ngram

The file ngrams.sml includes a function called getWords that can be used to get the list of (n – 1) leading words from an ngram.

As mentioned earlier, we want to have fast lookup for any given set of (n – 1) words. We will accomplish that by storing the ngrams in a binary search tree. The file ngrams.sml has the following definition for the tree:

datatype ngramTree = Empty | Node of ngram * ngramTree * ngramTree;

In all of the problems that follow we will assume that we are working with legal ngrams and legal ngram trees. In particular, we will assume that we are working with “2 or more”-grams, that the overall frequency included in an ngram is equal to the sum of the individual frequencies of the various completion words, that no completion words have a frequency of 0, that the list of completion words is nonempty, that the ngram tree is a proper binary search tree based on the (n – 1) word lists, and that the ngram tree is nonempty. We will make one additional assumption that the ngram tree contains at least one ngram that begins with a capitalized word. The reason will become obvious when you work on problem 7.

You are not allowed to solve these problems using arrays, references or vectors but otherwise you are allowed to use the full range of ML constructs and you can call functions from the basis library. You might find the libraries List and Char particularly helpful.

1. (5 points) Define a function stringListCompare(list1, list2) that takes two string lists as arguments and that returns a value of type order indicating how the lists compare:

val stringListCompare = fn : string list * string list -> order

Type order is described on page 325 of the textbook. Your function should compare the lists to find the first pair of words that differ. If there is no such pair, your list should return EQUAL. If there is a pair that differs, your function should return LESS if the word from list1 is less than the word from list2 and should return GREATER if the word from list1 is greater than the word from list2. You may assume the lists are of equal length, although it’s not a bad idea to raise an exception if you find yourself running out of data in one list but not the other.

2. (10 points) Define a function addTo(ng, w) that takes an ngram and a word (a string) as arguments and that returns a new ngram that includes the given word in the list of completions:

val addTo = fn : ngram * string -> ngram

If the word appears in the list of completions in the original ngram, then the new ngram should record that its frequency is 1 more than in the old ngram. If the word does not appear in the original ngram, it should be added to the list with a frequency of 1.

3. (10 points) Define a function insert(words, word, tree) that takes a list of (n – 1) words (strings), a completion word (string) and an ngram tree and that returns the tree obtained by inserting the words/word combination into the tree:

val insert = fn : string list * string * ngramTree -> ngramTree

If the set of words is already included in the tree, then the completion word should be added to that ngram. If not, then a new ngram should be added to the tree. The stringListCompare function should be used to order the tree.

4. (5 Points) Define a function insertList(list) that takes a list of (n – 1 words, completion word) tuples (string list * string tuples) and that returns the ngram tree obtained by inserting each combination into an empty ngram tree:

val insertList = fn : (string list * string) list -> ngramTree

5. (10 points) Define a function find(words, tree) that takes a list of (n – 1) words (strings) and an ngram tree as parameters and that returns an ngram option that is either empty (words not found) or is the ngram associated with that set of (n – 1) leading words in the given search tree:

val find = fn : string list * ngramTree -> ngram option

6. (10 Points) Define a function groupWords(n, list) that takes an integer n and a list of words (strings) and that returns a list of all (n – 1 words, completion word) tuples from the list:

val groupWords = fn : int * 'a list -> ('a list * 'a) list

The function should use the first n words to form a tuple, then the n words starting with the second word, then the n words starting with the third word, and so on until it finds there aren’t n words left to form a tuple. For example, groupWords(3, ["Twas", "brillig", "and", "the", "slithy", "toves"]) should return [(["Twas", "brillig"], "and"), (["brillig", "and"], "the"), (["and", "the"], "slithy"), (["the", "slithy"], "toves")]. Notice that if the list is of length m, then the function returns (n – m + 1) tuples. You may assume that n is at least 2. If the list does not have at least n words, your function should return an empty list.

7. (15 Points) Define a function pickRandom(tree) that takes a legal ngram tree as an argument and that randomly picks a set of (n – 1) leading words where the first word begins with a capital letter:

val pickRandom = fn : ngramTree -> string list

Only some of the ngrams in the tree will begin with capitalized words. Your function must pick randomly among them with each possibility having equal probability of being chosen. You might want to call the function randomInt() that is now part of the utility class and that returns a random int value. Because this function has to pick from the ngrams that begin with capitalized words, it is allowed to run in O(n) time where n is the size of the ngram tree.

8. (10 Points) Define a function randomWord(ng) that takes a legal ngram an an argument and that picks a random completion word from the list of completion words, preserving the statistical frequency of the word:

val randomWord = fn : ngram -> string

For example, suppose that we pass the function the ngram mentioned in the introduction:

Ngram

(["It","is"],19,

[("a",4),("here,",1),("the",2),("but",1),("indifferent",1),("our",1),

("not",2),("back'd",1),("as",1),("an",1),("'Adieu,",1),("not,",1),

("most",1),("offended.",1)]) : ngram

This set of 2 leading words had a total frequency of 19 with various completion words. Your function should return the word “a” with a frequency of 4/19 (four nineteenths of the time, which is around 21% of the time). The word “here” should be returned 1/19 of the time. The word “the” should be returned 2/19 of the time. And so on. By picking the words in this manner, we will ensure that the text that we build up will have approximately the same ngram statistics at the original text.

9. (5 Points) Define a function buildTree(n, fileName) that takes an integer n and file name (a string) as arguments and that returns an ngram tree constructed using the words from the given input file:

val buildTree = fn : int * string -> ngramTree

You can call the function read included in ngram.sml to convert the file into a list of tokens. You may assume that the given file exists and is readable and that n is at least 2.

10. (20 Points) Define a function buildList(n, tree) that takes an integer n and an ngram tree as parameters and that returns a list of approximately n words that are randomly generated using the given tree:

val buildList = fn : int * ngramTree -> string list

The function will have to pick a random set of (n – 1) words to use as the starting words. Using these words, it should use randomWord to pick a random word for the next word. Then it should take the most recent (n – 1) words and again pick a random completion word based on those words. Then it should again use the most recent (n – 1) words to pick a random completion word. And so on.

The list of words that you generate should end with a word that could be considered the last word of a sentence (a word ending with a period, question mark or exclamation mark). You will find the function sentenceEnd that is included in ngram.sml helpful to accomplish this task. Your function should randomly generate at least n words, but should keep generating words until it has generated one that can end a sentence. In other words, it should be as short as it can be so that it both ends with a word that can end a sentence and has at least n words.

There is one special case you are required to handle. It will occasionally happen that the most recent (n – 1) words selected are not in the ngram tree. This happens when you end up using the last (n – 1) words of the original text. When that occurs, append the (n – 1) words to the end of your answer and if that meets your quota of words to produce, then you can end. Don’t worry about whether or not the last word can end a sentence because if these words were good enough to end the original text, then they’re good enough to end our random text. If picking those words does not meet your quota, then after including those words, pick another random starting point by choosing a set of (n – 1) words that begin with a capitalized word.

Once you have these functions done, you will probably want to construct a tree of ngrams for a particular text, then generate a random list of words and then call the printList function from ngram.sml to display the results, as in:

- val t = buildTree(3, "hamlet.txt");

- val list = buildList(100, t);

- printList(list);

Horatio, and Marcellus. Fran. I think nothing, my lord. Ham. Arm'd, say

you? Both. Arm'd, my lord. Ham. I will be brief. Your noble son is mad. Mad

call I it; for, to define true madness, What is't but to one table. That's

the end. King. Alas, alas! Ham. A goodly one; in which there are no truant.

But what we do determine oft we break. Purpose is but a shadow. Ros. Truly,

and I behind an arras then. Mark the encounter. If he but blench, I know

not. Is it not monstrous that this folly douts it. Exit. King.

val it = () : unit

As question 9 indicates, you have to deal with certain unusual cases where the generator gets stuck. But in general, for these questions you can assume that we are dealing with a reasonably large input file with a variety of capitalized words and words that can end a sentence. We’ll also assume that the various integers passed as parameters make sense. If someone asks for the 0-grams of a file, they deserve whatever strange result they get.