handout #9
CS341—Programming Languages
ML Programming
Assignment #4
due 11 pm Wednesday, 1/31/07
In writing this
problem set you will need a set of utility functions defined in a file called
utility.sml available from the class web page.
This file has been updated since assignment 3, so you should obtain the
latest copy (your old programs should still work with this new version). You will also need a file called ngram.sml
that has supporting functions for this assignment.
In this assignment you
will write functions that analyze text to convert it into a series of n-grams
and that use those n-grams to generate random text that has the same n-gram
properties as the original. As the name
implies, an n-gram is something that depends on a certain choice of n. So we can have 2-grams, 3-grams, 4-grams, and
so on. The idea is to count the
occurrences of word sequences of length n in a text. For example, if you were to look at the
2-grams of Hamlet, you’d find that
that the words “To be” occur 6 times in the text. So we would say that one of the 2-grams of
the text is (“To”, “be”, 6). It turns
out that none of these six occurrences is part of the famous soliloquy, because
that one has a comma after “be,”. The decision about what characters to include
as part of the word is rather
arbitrary, but for our purposes, we will include the comma, which
leads to another 2-gram with an occurrence of 1: (“To”, “be,”, 1). In other words, there is only one occurrence
of that pair of tokens in the input file.
It would seem natural
to store an n-gram as a list of n words along with an integer count. Unfortunately, that won’t be a useful data
structure for us when we want to generate random text. Instead, we’re going to use (n – 1) words as
key that will be used to look up entry in a map structure. We will implement it in such a way that we
can quickly look up the entry for any set of (n – 1) words. For each set of (n – 1) words, we’ll keep a
list of word/count pairs. So each one of our
entries can represent several different n-grams. For example, suppose that we were computing
the 3-grams of Hamlet. Then the entries
in our map involve keys of two words. Consider
the two word list [“do”, “you”]. There
are 7 different 3-grams that begin with those two words (in other words, there
are 7 different completions). Below is
the list of the 7 completions with their frequencies:
[("think",3),("speak",1),("look?",1),("go",1),("call",1),("mark",1),
("read,",1)]
Using this list, we
know that the sequence [“do”, “you”, “think”] occurs 3 times in the text, the
sequence [“do”, “you”, “speak”] occurs 1 time, the sequence [“do”, “you”,
“look?”] occurs 1 time, and so on.
The file ngram.sml
includes the following type definition for an ngram:
datatype
ngram = Ngram of string list * int * (string * int) list;
Each
ngram has three components: a list of (n – 1) words, a total count for this set
of (n – 1) words, and a list of word/count frequencies for all words that
follow those (n – 1) words in the original text. For example, the n-gram for [“do”, “you”]
looks like this:
Ngram
(["do","you"],9,
[("think",3),("speak",1),("look?",1),("go",1),("call",1),("mark",1),
("read,",1)])
: ngram
The number 9 is the
sum of all of the frequencies in the word/count list. Here is another example from Hamlet using the
words [“It”, “is”]:
Ngram
(["It","is"],19,
[("a",4),("here,",1),("the",2),("but",1),("indifferent",1),("our",1),
("not",2),("back'd",1),("as",1),("an",1),("'Adieu,",1),("not,",1),
("most",1),("offended.",1)])
: ngram
The file ngrams.sml includes a
function called getWords that can be used to get the list of (n – 1) leading
words from an ngram.
As mentioned earlier,
we want to have fast lookup for any given set of (n – 1) words. We will accomplish that by storing the ngrams
in a binary search tree. The file
ngrams.sml has the following definition for the tree:
datatype
ngramTree = Empty | Node of ngram * ngramTree * ngramTree;
In all of the problems
that follow we will assume that we are working with legal ngrams and legal
ngram trees. In particular, we will
assume that we are working with “2 or more”-grams, that the overall frequency
included in an ngram is equal to the sum of the individual frequencies of the
various completion words, that no completion words have a frequency of 0, that
the list of completion words is nonempty, that the ngram tree is a proper
binary search tree based on the (n – 1) word lists, and that the ngram tree is
nonempty. We will make one additional
assumption that the ngram tree contains at least one ngram that begins with a
capitalized word. The reason will become
obvious when you work on problem 7.
You are not allowed to
solve these problems using arrays, references or vectors but otherwise you are
allowed to use the full range of ML constructs and you can call functions from
the basis library. You might find the
libraries List and Char particularly helpful.
1.
(5 points) Define a function
stringListCompare(list1, list2) that takes two string lists as arguments and
that returns a value of type order indicating how the lists compare:
val
stringListCompare = fn : string list * string list -> order
Type order is
described on page 325 of the textbook.
Your function should compare the lists to find the first pair of words
that differ. If there is no such pair,
your list should return EQUAL. If there
is a pair that differs, your function should return LESS if the word from list1
is less than the word from list2 and should return GREATER if the word from
list1 is greater than the word from list2.
You may assume the lists are of equal length, although it’s not a bad
idea to raise an exception if you find yourself running out of data in one list
but not the other.
2.
(10 points) Define a function addTo(ng,
w) that takes an ngram and a word (a string) as arguments and that returns a
new ngram that includes the given word in the list of completions:
val addTo
= fn : ngram * string -> ngram
If the word appears in
the list of completions in the original ngram, then the new ngram should record
that its frequency is 1 more than in the old ngram. If the word does not appear in the original
ngram, it should be added to the list with a frequency of 1.
3.
(10 points) Define a function
insert(words, word, tree) that takes a list of (n – 1) words (strings), a
completion word (string) and an ngram tree and that returns the tree obtained
by inserting the words/word combination into the tree:
val
insert = fn : string list * string * ngramTree -> ngramTree
If the set of words is
already included in the tree, then the completion word should be added to that
ngram. If not, then a new ngram should
be added to the tree. The
stringListCompare function should be used to order the tree.
4.
(5
Points) Define a function insertList(list) that takes a list of (n – 1 words,
completion word) tuples (string list * string tuples) and that returns the
ngram tree obtained by inserting each combination into an empty ngram tree:
val
insertList = fn : (string list * string) list -> ngramTree
5.
(10 points) Define a function find(words,
tree) that takes a list of (n – 1) words (strings) and an ngram tree as
parameters and that returns an ngram option that is either empty (words not
found) or is the ngram associated with that set of (n – 1) leading words in the
given search tree:
val find
= fn : string list * ngramTree -> ngram option
6.
(10 Points) Define a function
groupWords(n, list) that takes an integer n and a list of words (strings) and
that returns a list of all (n – 1 words, completion word) tuples from the list:
val
groupWords = fn : int * 'a list -> ('a list * 'a) list
The function should
use the first n words to form a tuple, then the n words starting with the
second word, then the n words starting with the third word, and so on until it
finds there aren’t n words left to form a tuple. For example, groupWords(3, ["Twas",
"brillig", "and", "the", "slithy",
"toves"]) should return [(["Twas", "brillig"],
"and"), (["brillig", "and"], "the"),
(["and", "the"], "slithy"), (["the",
"slithy"], "toves")].
Notice that if the list is of length m, then the function returns (n – m
+ 1) tuples. You may assume that n is at
least 2. If the list does not have at
least n words, your function should return an empty list.
7.
(15 Points) Define a function
pickRandom(tree) that takes a legal ngram tree as an argument and that randomly
picks a set of (n – 1) leading words where the first word begins with a capital
letter:
val
pickRandom = fn : ngramTree -> string list
Only some of the
ngrams in the tree will begin with capitalized words. Your function must pick randomly among them
with each possibility having equal probability of being chosen. You might want to call the function randomInt() that is now part of the utility class and that
returns a random int value. Because this
function has to pick from the ngrams that begin with capitalized words, it is
allowed to run in O(n) time where n is the size of the
ngram tree.
8.
(10 Points) Define a function
randomWord(ng) that takes a legal ngram an an argument and that picks a random
completion word from the list of completion words, preserving the statistical
frequency of the word:
val
randomWord = fn : ngram -> string
For example, suppose
that we pass the function the ngram mentioned in the introduction:
Ngram
(["It","is"],19,
[("a",4),("here,",1),("the",2),("but",1),("indifferent",1),("our",1),
("not",2),("back'd",1),("as",1),("an",1),("'Adieu,",1),("not,",1),
("most",1),("offended.",1)])
: ngram
This set of 2 leading
words had a total frequency of 19 with various completion words. Your function should return the word “a” with
a frequency of 4/19 (four nineteenths of the
time, which is around 21% of the time).
The word “here” should be returned 1/19 of the time. The word “the” should be returned 2/19 of the
time. And so on. By picking the words in this manner, we will
ensure that the text that we build up will have approximately the same ngram
statistics at the original text.
9.
(5
Points) Define a function buildTree(n, fileName) that takes an integer n and
file name (a string) as arguments and that returns an ngram tree constructed
using the words from the given input file:
val
buildTree = fn : int * string -> ngramTree
You can call the
function read included in ngram.sml to convert the file into a list of tokens. You may assume that the given file exists and
is readable and that n is at least 2.
10. (20 Points) Define a function buildList(n,
tree) that takes an integer n and an ngram tree as parameters and that returns
a list of approximately n words that are randomly generated using the given
tree:
val
buildList = fn : int * ngramTree -> string list
The function will have
to pick a random set of (n – 1) words to use as the starting words. Using these words, it should use randomWord
to pick a random word for the next word.
Then it should take the most recent (n – 1) words and again pick a
random completion word based on those words.
Then it should again use the most recent (n – 1) words to pick a random
completion word. And so on.
The list of words that
you generate should end with a word that could be considered the last word of a
sentence (a word ending with a period, question mark or exclamation mark). You will find the function sentenceEnd that
is included in ngram.sml helpful to accomplish this task. Your function should randomly generate at
least n words, but should keep generating words until it has generated one that
can end a sentence. In other words, it
should be as short as it can be so that it both ends with a word that can end a
sentence and has at least n words.
There is one special
case you are required to handle. It will
occasionally happen that the most recent (n – 1) words selected are not in the
ngram tree. This happens when you end up
using the last (n – 1) words of the original text. When that occurs, append the (n – 1) words to
the end of your answer and if that meets your quota of words to produce, then
you can end. Don’t worry about whether
or not the last word can end a sentence because if these words were good enough
to end the original text, then they’re good enough to end our random text. If picking those words does not meet your
quota, then after including those words, pick another random starting point by
choosing a set of (n – 1) words that begin with a capitalized word.
Once you have these
functions done, you will probably want to construct a tree of ngrams for a
particular text, then generate a random list of words and then call the
printList function from ngram.sml to display the results, as in:
- val t =
buildTree(3, "hamlet.txt");
- val list
= buildList(100, t);
- printList(list);
Horatio, and Marcellus. Fran. I
think nothing, my lord. Ham. Arm'd, say
you? Both.
Arm'd, my lord. Ham. I will be brief. Your noble son is mad. Mad
call I it;
for, to define true madness, What is't but to one table. That's
the end.
King. Alas, alas! Ham. A goodly one; in which there are no truant.
But what we do determine oft we
break. Purpose is but a shadow. Ros. Truly,
and I
behind an arras then. Mark the encounter. If he but blench,
I know
not. Is it
not monstrous that this folly douts it. Exit. King.
val it =
() : unit
-
As question 9
indicates, you have to deal with certain unusual cases where the generator gets
stuck. But in general, for these
questions you can assume that we are dealing with a reasonably large input file
with a variety of capitalized words and words that can end a sentence. We’ll also assume that the various integers
passed as parameters make sense. If
someone asks for the 0-grams of a file, they deserve whatever strange result
they get.