CSE143 Notes for Friday, 2/3/06

I began by mentioning that with the next programming assignment, we are asking you to start making more use of the built-in collections classes. I reminded people that the collections classes are all in the java.util package. In particular, for this next programming assignment we are going to use a collection known as a SortedMap.

As an example, I asked people how we could write a program that would count all of the occurrences of various words in an input file. Someone mentioned having a Scanner object that is tied to a file and making a series of calls on the method called next to read individual words:

        set up a Scanner called input
        while (input.hasNext()) {
            String next = input.next();
            process next
        }

I pointed out that we'll have trouble with things like capitalization and punctuation. I said that we should at least turn the string to all lowercase letters so that we don't count Strings like "The" and "the" as different words:

        set up a Scanner called input
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            process next
        }

But I said that dealing with punctuation was more than I wanted to attempt in this program, so I decided that we'd live with the fact that Strings like "the" and "the," and "the." would be considered different words. We're looking for a fairly simple example here, so I didn't want to worry too much about punctuation.

The pseudocode above says to "process next", but what does that mean? Someone suggested that we'd need a counter. That's true, but we actually need a lot of counters. We're not just counting the occurrences of a single word like "the", we're trying to count the occurrences of all of the words of the file.

Then someone suggested that maybe we could have an array of counters, one for each word. Someone else suggested a linked list of counters. But how would we know which counter goes with which word? Someone else suggested an object that would group a word and counter together, which I said is a good idea, but it still requires you to search through the list to find something, which can be very slow. There is a better way to do this.

This is where the SortedMap comes in. The idea behind a map is that it keeps track of key/value pairs. In our case, we want to keep track of word/count pairs (what is the count for each different word). There is an interface called SortedMap that is a generic interface. That means that we have to supply type information. It's formal description is SortedMap<K, V>. This is different from the Queue interface in that it has two different types. That's because the map has to know what type of keys you have and what type of values you have. In our case, we have some words (Strings) that we want to associated with some counters (ints). We can't actually use type int because it is a primitive type, but we can use type Integer.

So our map would be of type SortedMap<String, Integer>. In other words, it's a a map that keeps track of String/Integer pairs (this String goes to this Integer). SortedMap is the name of the interface, but it's not an actual implementation. The implementation we will use is TreeMap. So we can construct a map called "count" to keep track of our counts by saying:

        SortedMap<String, Integer> count = new TreeMap<String, Integer>();

There are only a few methods that we'll be using from the SortedMap interface. The most basic allow you to put something into the map (an operation called put) and to ask the map for the current value of something (an operation called get). The other method we'll use in the word counting task is containsKey, which can be used to ask the map whether it contains a certain key.

So combining this map declaration with our pseudocode we get:

        set up a Scanner called input
        SortedMap<String, Integer> count = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            process next
        }

Now we can flesh out what it means to "process next". There are two possibilities. Either we're seeing next for the first time or we've already seen it before. We can use the containsKey method to distinguish between the two cases. If it's the first time we've seen this word, then we want to put it into our map with a count of 1:

        set up a Scanner called input
        SortedMap<String, Integer> count = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            if (!count.containsKey(next)) {
                count.put(next, 1);
            } else {
                do whatever is appropriate for a word we've seen before
            }
        }

So what do we do with a word we've seen before? We want to increase its count by 1. That means we have to get the old value of the count and add 1 to it:

        count.get(next) + 1

and make this the new value of the counter:

        count.put(next, count.get(next) + 1);

This is the line of code we want to use in our pseudocode:

        set up a Scanner called input
        SortedMap<String, Integer> count = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            if (!count.containsKey(next)) {
                count.put(next, 1);
            } else {
                count.put(next, count.get(next) + 1);
            }
        }

I pointed out that it's awkward to use the calls on get and put, but basically this is a lot like what we do with arrays. For example, the if/else above is a lot like the following array-like code:

        if (!count.containsKey(next)) {
            count[next] = 1;
        } else {
            count[next] = count[next] + 1;
        }

In the programming language C# you can write it exactly this way because C# allows you to do something called "operator overloading" (giving a new definition for the square brackets). In fact, in C# this could be written as:

        if (!count.containsKey(next)) {
            count[next] = 1;
        } else {
            count[next]++;
        }

The folks at Sun don't like operator overloading, so we're stuck with the more tedious calls on get and put, but this isn't a bad way of thinking about what is going on here.

The first time we see a word, we call the put method and say that the map should associate the word with a count of 1. Later we call put again with a higher count. And we keep calling put every time the count goes up. What happens to the old values that we had put in the map previously? The way the map works, each key is associated with only one value. So when you call put a second or third time, you are wiping out the old association. The new key/value pair replaces the old key/value pair in the map.

I only briefly mentioned other parts of the sample program. I mentioned that it is doing some fancy things with a SortedMap.Entry that represents an actual key/value pair. This later code also has an example of the "extended for loop" also known as the "foreach loop" which was added to Java starting with version 1.5. It's not important to understand these other manipulations of the map. The most important things to understand are:

how to construct an empty map with a variable of type SortedMap and an object of type TreeMap
how to call the put method to establish a key/value pair
how to call the get method to look up a value given its key
(this was not included in the sample program, but is described in the assignment writeup) how to use a method called keySet to get a reference to the set of keys in the map (see the assignment writeup for details)

One final point I made about the SortedMap interface is that you can associate just about anything with just about anything. In the word counting program, we associated strings with integers. You could also associate strings with strings. One thing you can't do is to have multiple associations in a single map. For example, if you decide to associate strings with strings, then any given string can be associated with just a single string. But there's no reason that you can't have the second value be structured in some way. You can associate strings with arrays or strings with ArrayLists.

Then I switched to talking about grammars. We are going to use an approach to describing grammar that is known as a "production system". It is well known to those who study formal linguistics. Computer scientists know a lot about them because we design our own languages like Java. This particular style of production is known as BNF (short for Backus-Naur Form). Each production describes the rules for a particular non-terminal symbol. The non-terminal appears first followed by the symbol "::=" which is usually read as "is composed of". On the right-hand side of the "::=" we have a series of rules separated by the vertical bar character which we read as "or". The idea is that the non-terminal symbol can be replaced by any of the sequences of symbols appearing between vertical bar characters.

We can describe the basic structure of an English sentence as follows:

        <s> ::= <np> <vp>

We would read this as, "A sentence (<s>) is composed of a noun phrase (<np>) followed by a verb phrase (<vp>)." The symbols <s>, <np> and <vp> are known as "non-terminals" in the grammar. That means that we don't expect them to appear in the actual sentences that we form from the grammar.

Then we "drilled down" a bit into what a noun phrase might look like. I suggested that the simplest form of noun phrase would be a proper noun, which I expressed this way:

        <np> ::= <pn>

So then I asked people for examples of proper nouns and we ended up with this rule:

        <pn> ::= Sam | River Tam | Paris | Stuart | Hannibal | Jean Luc Picard | Bob Barker | Chuck Norris

Notice that the vertical bar character is being used to separate different possibilities. In other words, we're saying that "a proper noun is either Sam or River Tam or Paris or Stuart or Hannibal or ..." These values on the right-hand side are examples of "terminals". In other words, we expect these to be part of the actual sentences that are formed. I pointed out that it is important to realize that the input is "tokenized" using white space. For example, the text "Jean Luc Picard" is broken up into three separate tokens. So it's not a single terminal, it's actually three different terminals.

Then I said that there are other kinds of noun phrases than just proper nouns. For example, we might have the word "the" followed by a noun. So I added a new rule to the "noun phrase":

        <np> ::= <pn> | the <n>

Notice how the vertical bar character is used to indicate that a noun phrase is either a proper noun or it's the word "the" followed by a noun. This required the addition of a new rule for nouns and I again asked for suggestions from the audience:

        <n> ::= frog | shotgun | cowboy boots | person | pitchfork | numchucks | illegal aliens

At this point I mentioned the fact that we're going to use a slight variation of BNF notation. To keep things simple, we'll use just a simple colon in place of the "::=" in the rules above. So our three rules became:

        <s>: <np> <vp>
        <np>: <pn> | the <n>
        <pn>: Sam | River Tam | Paris | Stuart | Hannibal | Jean Luc Picard | Bob Barker | Chuck Norris
    <n>: frog | shotgun | cowboy boots | person | pitchfork | numchucks | illegal aliens

I saved this file and ran the program. It read the file and began by saying:

        Available symbols to generate are:
        [<n>, <np>, <pn>, <s>]
        What do you want generated (return to quit)?

I pointed out that we are defining a nonterminal to be any symbol that appears to the left of a colon in one of our productions. The input file has four productions and that is why the program is showing four nonterminals that can be generated by the grammar. I began by asking for it to generate 5 of the "<pn>" nonterminal symbol and got something like this:

        Sam
        Stuart
        Sam
        Sam
        Bob Barker

In this case, it is simply choosing at random among the various choices for a proper noun. Then I asked it for five of the "<np>" nonterminal symbol and got something like this:

        the cowboy boots
        Paris
        the person
        the shotgun
        Bob Barker

In this case, it is randomly choosing between the "proper noun" rule and the other rule that involves "the" and a noun. It is also then filling in the noun or proper noun to form a string of all terminal symbols. I also asked for five of the nontermainl symbol <s> and got something like this:

        River Tam <vp>
        Stuart <vp>
        the shotgun <vp>
        the cowboy boots <vp>
        Sam <vp>

These are not proper sentences because they contain the nonterminal symbol <vp>. That's because we never finished our grammar. We haven't yet defined what a verb phrase looks like. Notice that the program doesn't care about whether or not something is enclosed in the less-than and greater-than characters, as in "<vp>". That's a convention that is often followed in describing grammar, but that's not how our program is distinguishing between terminals and nonterminals. As mentioned earlier, anything that appears to the left of a colon is considered a nonterminal and every other token is considered a terminal.

Then we added some more elements to our grammar by defining a verb phrase and adding some adjectives:

        <s>: <np> <vp>
        <np>: <pn> | the <adj> <n>
        <pn>: Sam | River Tam | Paris | Stuart | Hannibal | Jean Luc Picard | Bob Barker | Chuck Norris
        <adj>: red | fun | tiny | hungry | slippery | boring | keen | slothlike
        <n>: frog | shotgun | cowboy boots | person | pitchfork | numchucks | illegal aliens
        <vp>: <iv> <np> | <tv>
        <tv>: died | ran | fired | swooned | spooned | fled
        <iv>: birthed | round-house kicked | fled | courted

I was clearly confused as I put this together because I've reversed the meaning of transitive and intransitive verbs. My high school grammar teacher would mark me down if she heard about it. But we got some funny things from this grammar. When I asked for 20 sentences (20 of the nonterminal symbol "<s>"), we got something like the following:

        the hungry person round-house kicked Stuart
        the fun frog died
        Bob Barker spooned
        Stuart died
        Paris fled
        the slippery numchucks died
        Jean Luc Picard fled the keen cowboy boots
        the boring shotgun round-house kicked Jean Luc Picard
        Hannibal courted Bob Barker
        the tiny illegal aliens fired
        the hungry pitchfork spooned
        the fun pitchfork fired
        Paris spooned
        Sam round-house kicked Jean Luc Picard
        Paris fled
        Sam courted Bob Barker
        the fun pitchfork birthed Sam
        Hannibal round-house kicked the slippery frog
        Jean Luc Picard fled the hungry shotgun
        the red person ran

We made the grammar even more interesting by replacing the rule for a single adjective with a rule for an adjective phrase:

        <np>: <pn> | the <adjp> <n>

We wanted to allow multiple adjectives. I suggested saying:

        <adjp>: <adj> | <adj> <adj>

That would allow one or two adjectives. Someone quickly realized that by using recursion, we can allow any number of adjectives:

        <adjp>: <adj> | <adj> <adjp>

This allows any number of adjectives to be included. We added similar changes for adverbs and an adverb phrase to end up with this final version of the grammar:

        <s>: <np> <vp>
        <np>: <pn> | the <adjp> <n>
        <pn>: Sam | River Tam | Paris | Stuart | Hannibal | Jean Luc Picard | Bob Barker | Chuck Norris
        <adjp>: <adj> <adjp> | <adj>
        <adj>: red | fun | tiny | hungry | slippery | boring | keen | slothlike
        <n>: frog | shotgun | cowboy boots | person | pitchfork | numchucks | illegal aliens
        <vp>: <iv> <np> | <tv> | <advp> <iv> <np> | <advp> <tv>
        <tv>: died | ran | fired | swooned | spooned | fled
        <iv>: birthed | round-house kicked | fled | courted
        <adv>: really | very | adequately
        <advp>: <adv> | <adv> <advp>

When we ran it, we got sentences like the following:

        Bob Barker adequately fired
        Hannibal fled
        Hannibal ran
        the keen person round-house kicked the slippery tiny pitchfork
        Stuart very courted Chuck Norris
        Chuck Norris adequately swooned
        the hungry person fled the boring tiny slippery slothlike pitchfork
        Paris courted the keen boring keen slippery boring red person
        Bob Barker really courted Jean Luc Picard
        the hungry person died
        the keen person birthed River Tam
        the keen cowboy boots very really really spooned
        Paris adequately really died
        Stuart very very very very really swooned
        the slothlike pitchfork adequately fled Jean Luc Picard
        the boring pitchfork courted Paris
        the keen person really courted Chuck Norris
        Hannibal fired
        River Tam fled
        Sam really birthed Chuck Norris
        Bob Barker adequately really very adequately really very birthed River Tam
        the fun tiny boring shotgun adequately round-house kicked the boring hungry illegal aliens
        Paris spooned
        Hannibal courted the boring cowboy boots
        the red slothlike illegal aliens very very round-house kicked the keen cowboy boots
        the slippery illegal aliens very courted the tiny fun slothlike frog
        Hannibal very very birthed Paris
        the hungry tiny slippery keen fun fun slothlike cowboy boots very fired
        the slothlike tiny numchucks adequately really round-house kicked the hungry slothlike illegal aliens
        the tiny tiny cowboy boots really really swooned

It's important to realize that the grammar program can be used to generate any piece of this grammar, although in this case the most amusing ones to form are the complete sentences.

Stuart Reges

Last modified: Mon Feb 6 14:12:12 PST 2006