handout #5
CSE341—Programming Languages
ML Programming
Assignment #4
due 11 pm Thursday, 4/29/10
In this
assignment you will write functions that analyze text to convert it into a
series of n-grams and that use those n-grams to generate random text that has
the same n-gram properties as the original.
As the name implies, an n-gram is something that depends on a certain
choice of n. So we can have 2-grams,
3-grams, 4-grams, and so on. The idea is
to count the occurrences of word sequences of length n in a text. For example, if you were to look at the
2-grams of Hamlet, you’d find that
that the words “To be” occur 6 times in the text. So we would say that one of the 2-grams of
the text is (“To”, “be”, 6). It turns
out that none of these six occurrences is part of the famous soliloquy, because
that one has a comma after “be,”. The decision about what characters to include
as part of the word is rather
arbitrary, but for our purposes, we will include the comma, which
leads to another 2-gram with an occurrence of 1: (“To”, “be,”, 1). In other words, there is only one occurrence
of that pair of tokens in the input file.
We will once
again use the utility functions defined in utility.sml. You will also need a file called ngram.sml
that has supporting functions for this assignment. You should include all of your answers in a file called hw4.sml.
It would seem
natural to store an n-gram as a list of n words along with an integer
count. Unfortunately, that won’t be a
useful data structure for us when we want to generate random text. Instead, we’re going to use the first (n – 1)
words as a key that will be used to look up an entry in a map structure. In the remainder of this specification, the
term “leading words” will refer to a set of (n – 1) words being used as a key
and the term “completion word” will refer to one of the possible completions
for those (n – 1) words. Each
combination of a set of leading words with a completion word will represent a
single n-gram.
We will
implement it in such a way that we can quickly look up the entry for any set of
leading words. For each set of leading
words, we’ll keep a list of word/count pairs (one for each completion
word). So each one of
our entries can represent several different n-grams. For example, suppose that we were computing
the 3-grams of Hamlet. Then the entries
in our map involve keys of two words.
Consider the two word list [“do”, “you”]. There are 7 different 3-grams that begin with
those two words (in other words, there are 7 different completions). Below is the list of the 7 completions with
their frequencies:
[("think",3),("speak",1),("look?",1),("go",1),("call",1),("mark",1),
("read,",1)]
Using this
list, we know that the sequence [“do”, “you”, “think”] occurs 3 times in the
text, the sequence [“do”, “you”, “speak”] occurs 1 time, the sequence [“do”,
“you”, “look?”] occurs 1 time, and so on.
The file
ngram.sml includes the following type definition for an ngram:
datatype ngram = Ngram of string list * int * (string * int) list;
Each
ngram has three components: a list of leading words, a total count for this set
of leading words, and a list of word/count frequencies for all completion words
that follow these leading words in the original text. For example, the ngram for [“do”, “you”]
looks like this:
Ngram
(["do","you"],9,
[("think",3),("speak",1),("look?",1),("go",1),("call",1),("mark",1),
("read,",1)]) : ngram
The number 9
is the sum of all of the frequencies in the word/count list. Here is another example from Hamlet using the
words [“It”, “is”]:
Ngram
(["It","is"],19,
[("a",4),("here,",1),("the",2),("but",1),("indifferent",1),("our",1),
("not",2),("back'd",1),("as",1),("an",1),("'Adieu,",1),("not,",1),
("most",1),("offended.",1)]) : ngram
The file ngrams.sml includes a
function called getWords that can be used to get the leading words from an
ngram.
As mentioned
earlier, we want to have fast lookup for any given set of leading words. We will accomplish that by storing the ngrams
in a binary search tree. The file
ngrams.sml has the following definition for the tree:
datatype
ngramTree = Empty | Node of ngram * ngramTree * ngramTree;
In all of the
problems that follow we will assume that we are working with legal ngrams and
legal ngram trees. In particular, we
will assume that we are working with “2 or more”-grams, that the overall
frequency included in an ngram is equal to the sum of the individual
frequencies of the various completion words, that no completion words have a
frequency of 0, that the list of completion words is nonempty, that the ngram
tree is a proper binary search tree based on the leading word lists, and that
the ngram tree is nonempty. We will also
assume that data passed to our functions are consistent in terms of their ngram
length. For example, we will assume that
our functions won’t be asked to include both a 2-gram and a 3-gram in the same
ngram tree. We will make one additional
assumption that the ngram tree contains at least one ngram that begins with a
capitalized word. The reason will become
obvious when you work on problem 7.
You are not
allowed to solve these problems using arrays, references or vectors but
otherwise you are allowed to use the full range of ML constructs and you can
call functions from the basis library.
You might find the libraries List and Char particularly helpful.
1.
(5 points) Define a function
stringListCompare(list1, list2) that takes two string lists as arguments and
that returns a value of type order indicating how the lists compare:
val stringListCompare = fn : string list * string list -> order
Type order is
described on page 325 of the textbook.
Your function should compare the lists to find the first pair of words
that differ. If there is no such pair,
your list should return EQUAL. If there
is a pair that differs, your function should return LESS if the word from list1
is less than the word from list2 and should return GREATER if the word from
list1 is greater than the word from list2.
You may assume the lists are of equal length, although it’s not a bad
idea to raise an exception if you find yourself running out of data in one list
but not the other.
2.
(10 points) Define a function
addTo(ng, w) that takes an ngram and a completion word (a string) as arguments
and that returns a new ngram that includes the given word in the list of
completions:
val addTo = fn : ngram * string -> ngram
If the word
appears in the list of completions in the original ngram, then the new ngram
should record that its frequency is 1 more than in the old ngram. If the word does not appear in the original
ngram, it should be added to the list with a frequency of 1.
3.
(10 points) Define a function
insert(words, word, tree) that takes a list of leading words (strings), a
completion word (string) and an ngram tree and that returns the tree obtained
by inserting the leading words/completion word combination into the tree:
val
insert = fn : string list * string * ngramTree -> ngramTree
If the set of
leading words is already included in the tree, then the completion word should
be added to that ngram. If not, then a
new ngram should be added to the tree with a frequency of 1. The stringListCompare function should be used
to order the tree.
4.
(5 Points) Define a function insertList(list)
that takes a list of (leading words, completion word) tuples (string list *
string tuples) and that returns an ngram tree built by inserting each
combination into an initially empty ngram tree (the combinations can be added
in any order):
val insertList = fn : (string list * string) list -> ngramTree
5.
(10 points) Define a function find(words,
tree) that takes a list of leading words (strings) and an ngram tree as
parameters and that returns an ngram option that is either NONE (words not
found) or is the ngram associated with that set of leading words in the given
search tree:
val find
= fn : string list * ngramTree -> ngram option
6.
(10 Points) Define a function
groupWords(n, list) that takes an integer n and a list of words (strings) and
that returns a list of all (leading words, completion word) tuples from the
list:
val groupWords = fn : int * 'a list -> ('a list * 'a) list
The function
should use the first n words to form a tuple, then the n words starting with
the second word, then the n words starting with the third word, and so on until
it finds there aren’t n words left to form a tuple. For example, groupWords(3, ["Twas",
"brillig", "and", "the", "slithy",
"toves"]) should return [(["Twas", "brillig"],
"and"), (["brillig", "and"], "the"),
(["and", "the"], "slithy"), (["the",
"slithy"], "toves")].
Notice that if the list is of length m, then the function returns (m – n
+ 1) tuples. You may assume that n is at
least 2. If the list does not have at
least n words, your function should return an empty list.
7.
(15 Points) Define a function
pickRandom(tree) that takes a legal ngram tree as an argument and that randomly
picks a set of leading words where the first word begins with a capital letter:
val pickRandom = fn : ngramTree -> string list
Only some of
the ngrams in the tree will begin with capitalized words. Your function must pick randomly among them
with each possibility having equal probability of being chosen. You might want to call the function randomInt() that is part of the utility class and that
returns a random int value. Because this
function has to pick from the ngrams that begin with capitalized words, it is
allowed to run in O(n) time where n is the size of the
ngram tree. Remember that you can assume
that the tree will contain at least one ngram that begins with a capitalized
word.
8.
(10 Points) Define a function
randomWord(ng) that takes a legal ngram as an argument and that picks a random
completion word from the list of completion words, preserving the statistical
frequency of the word:
val randomWord = fn : ngram -> string
For example, suppose
that we pass the function the ngram mentioned in the introduction:
Ngram
(["It","is"],19,
[("a",4),("here,",1),("the",2),("but",1),("indifferent",1),("our",1),
("not",2),("back'd",1),("as",1),("an",1),("'Adieu,",1),("not,",1),
("most",1),("offended.",1)]) : ngram
This set of 2
leading words has a total frequency of 19 with various completion words. Your function should return the word “a” with
a frequency of 4/19 (four nineteenths of the time, which is around 21% of the
time). The word “here” should be
returned 1/19 of the time. The word
“the” should be returned 2/19 of the time.
And so on. By picking the words
in this manner, we will ensure that the text that we build up will have
approximately the same ngram statistics at the original text.
9.
(5 Points) Define a function buildTree(n,
fileName) that takes an integer n and file name (a string) as arguments and
that returns an ngram tree constructed using the words from the given input
file with ngrams of length n (i.e., having (n – 1) leading words and a set of
completion words):
val buildTree = fn : int * string -> ngramTree
You can call
the function read included in ngram.sml to convert the file into a list of
tokens. You may assume that the given
file exists and is readable and that n is at least 2.
10. (20 Points) Define a function buildList(count,
tree) that takes an integer count and an ngram tree as parameters and that
returns a list of approximately count words that are randomly generated using
the given tree:
val buildList = fn : int * ngramTree -> string list
Below
is a pseudocode description of how the list should be constructed:
Use pickRandom
to pick a random set of leading words that begins with a capitalized word.
Include the
first word from the current set of leading words in the result.
while
(not done) {
“Shift” the
leading words by dropping the first word and adding a completion word obtained
by calling randomWord on the current set of leading words.
Include the
first word from the current set of leading words in the list.
}
The list of
words that you generate should end with a word that could be considered the
last word of a sentence (a word ending with a period, question mark or
exclamation mark). You will find the
function sentenceEnd that is included in ngram.sml helpful to accomplish this
task. It’s possible that you will be
able to generate exactly count words, but more often you will have to generate
extra words so that you can also satisfy the constraint that you end with a
word that can end a sentence. Your
function should stop generating words once the two constraints have been
satisfied (at least count words, ends with a word that can end a sentence).
For example, suppose that you
were working with 3-grams generated from Tom
Sawyer and you are asked to build a list that has two words. The sequence of choices might go like this:
Action |
Current window |
Current result |
call pickRandom |
["But", "all"] |
["But"] |
shift |
["all", "things"] |
["But", "all"] |
shift |
["things",
"else."] |
["But", "all",
"things"] |
shift |
["else.", "At"] |
["But", "all",
"things", "else."] |
In this case,
we had to pick four words before we can stop because it took that long to find
a word that can end a sentence. Notice
that we always have a two-word window.
As we proceed, we move the front of this list to the result and pick a
new word to go at the end of the window.
There is one
special case you are required to handle.
It will occasionally happen that the most recent set of leading words
selected is not in the ngram tree. This
happens when you end up using the last (n – 1) words of the original text. When that occurs, you should add each of
these words one at a time to the list until you either run out of words or
satisfy the two ending constraints. For
our purposes, we will assume that the last word of the file can end a sentence
even if it doesn’t end with a period, question mark or exclamation mark (if
these words were good enough to end the original text, then they’re good enough
to end our random text). If picking
those words does not meet your quota (i.e., leaves you with fewer than count
words), then after including those words, pick another random starting point by
choosing a set of leading words that begin with a capitalized word and continue
the process from there.
Once you have
these functions done, you will probably want to construct a tree of ngrams for
a particular text, then generate a random list of words and then call the
printList function from ngram.sml to display the results, as in:
- val t = buildTree(3, "hamlet.txt");
- val list = buildList(100, t);
- printList(list);
Horatio, and Marcellus. Fran. I think nothing, my lord. Ham. Arm'd, say
you? Both. Arm'd, my lord. Ham. I will be brief. Your noble son is mad. Mad
call I it; for, to define true madness, What is't but to one table. That's
the end. King. Alas, alas! Ham. A goodly one; in which there are no truant.
But what we do determine oft we break. Purpose is but a shadow. Ros. Truly,
and I behind an arras then. Mark the encounter. If he but blench, I know
not. Is it not monstrous that this folly douts it. Exit. King.
val it = () : unit
-