CSE143 Notes 4/28/06

Grammars, maps, and a little randomness

Many parts of computer science involve the study of languages of various kinds. Obvious examples are programs that attempt to understand or translate natural languages, but even more common in a computer user's day-to-day experience is dealing with artificial languages, including programming languages like Java. A goal in analyzing languages, either natural or artificial, is to understand their syntax and semantics. Syntax deals with the structure of the language while semantics is concerned with the meaning of syntactically correct phrases. A long time goal in computer science has been to automate language processing. It turns out that semantics is a hard problem, but syntax is fairly well understood, at least for artificial languages like programming languages.

A useful way of dealing with syntax is to describe grammars using "production systems". The idea is to give the rules of a grammar as a series of productions that describe how different kinds of phrases are formed. There are various versions of these systems in linguistics and computer science; the version we'll use is a standard one in computer science known as Backus-Naur form (BNF).

As an example, we could describe the grammar for very simple English sentences with the following rules:

   <sentence> ::= <noun> <verb>
   <noun> ::= he | she
   <verb> ::= runs | sleeps

The first rule can be read as "a <sentence> consists of a <noun> followed by a <verb>". Each production or rule consists of a name to the left of the ::= mark and a sequence of names and words to the right. The names to the left of ::= are grammar variables or placeholders, called nonterminals. Traditionally nonterminals are indicated by a special font or, as we've done here, by surrounding them with angle brackets. The sequence of names and words in the right side of a rule can contain nonterminals, i.e., grammar variables, or terminals, which are words that represent themselves and are not placeholders for other words or phrases. A rule specifies ways in which the nonterminal to the left of ::= can be replaced, or rewritten, by substituting the sequence of names and words on the right. The right-hand side of a rule can contain a single set of names and words, or it can contain alternative sets separated by vertical bars. For example, the second rule says that "a <noun> consists of the word 'he' or the word 'she' ".

We can use the rules in the example to generate a few sentences. We'll start with the first rule, which says that a <sentence> is a <noun> followed by a <verb>. This is a sequence of nonterminals, so we can expand them by using the rules for those nonterminals. For <noun> we can choose the terminal "he" or "she". Similarly, for <verb> we can choose "runs" or "sleeps". So this simple grammar can generate four possible sentences: "he runs", "she sleeps", "he sleeps", and "she runs".

The real power of a BNF production system is that the rules can refer to other rules, including recursive references to themselves. For example, we might have a sequence of rules to specify a noun phrase like this:

   <noun phrase> ::= <article> <adjectives> <noun>
   <article> ::= the | a | an
   <adjectives> ::= <adjective> <adjectives> | <adjective>
   <adjective> ::= big | small | hungry | blue | smart
   <noun> ::= dog | cat

We can use these rules to generate any number of noun phrases. Here is one possible sequence:

   <noun phrase> ::= <article> <adjectives> <noun>
                 ::= the <adjectives> <noun>
                 ::= the <adjective> <adjectives> <noun>
                 ::= the hungry <adjectives> <noun>
                 ::= the hungry <adjective> <noun>
                 ::= the hungry big <noun>
                 ::= the hungry big dog

For the next project, we will give you a main program that reads a file consisting of a sequence of grammar productions. The job of your program is, given a nonterminal, generate a sentence from that nonterminal by randomly choosing one of the possible rules to expand that nonterminal. If the expansion contains one or more nonterminals, pick a nonterminal and randomly choose one of its possible expansions. Repeat the process until there are no more nonterminals in the expansion, which means we have a sentence generated by the grammar.

The assignment writeup contains the full details, but there are a couple of things worth mentioning. First, we don't require that a nonterminal be surrounded by brackets (<>). Anything appearing on the left side of a rule should be treated as a nonterminal (grammar variable). Second, to simplify handling of rules, we'll use a single colon (:) instead of ::= to separate the left and right sides of rules. Finally, a rule may contain an arbitrary amount of whitespace (blanks and tabs), but this has no effect on the meaning of the rule.

Maps

In order to generate sentences, we need some way to store the grammar rules so we can look up the production for a nonterminal that has not yet been expanded. We could, of course, use a list of strings or something like that, but there are better data structures for the job - maps.

The idea behind a map, sometimes called a dictionary, is that it is a data structure that stores <key, value> pairs of objects. These objects can be anything, but typically, the key is fairly simple and the value is information associated with the key. An example would be an address book - the keys might be people's names and the value would be information about them - their address, phone number, email and IM addresses, and so forth. For the grammar application, the keys will be the nonterminals and the values will be the right-hand sides of the rules associated with them.

The two basic operations provided by a map are:

put(key,value) - store value in the map with the associated key. If the key is already present in the map, replace the value associated with it.
get(key) - return the value currently associated with the key.

In addition, maps usually provide methods to determine if a particular key is present in the map, to return the number of <key,value> pairs in the map, and return lists of all the keys or all the values in the map.

Java provides several kinds of map interfaces and classes in the java.util package, and we'll want to take advantage of these. All of the Java map types, as of Java 5, use generics to specify the types of the keys and the types of the elements. For our project, these will probably be Strings, but they could be anything.

For the project, you should use the SortedMap interface and TreeMap implementation. The advantage of this particular kind of map is that the keys are stored in the map in sorted order, which you'll find to be very useful in one part of the project.

We can create a new map to hold string keys and values as follows:

   SortedMap<String,String> = new TreeMap<String,String>();

Look at the Java library docs for SortedMap and TreeMap to discover which methods are available and how to use them.

Random

Another piece of the puzzle is processing a grammar rule that has more than one possible choice on its right side (for instance, <noun> ::= cat | dog | ferret). You are supposed to randomly pick one of the choices, which means somehow generating a random number and using that number to pick an alternative. You can use the Java class Random for this. First, create a single instance

   Random rand = new Random();

The rand object supplies methods that return a random value (integer, double, others) each time they are called. In particular, look up the details of the nextInt methods to find something useful for making random choices when expanding grammar rules.