CSE143 Notes for Friday, 4/30/10

With the next programming assignment, we are asking you to start using more features from the Java class libraries. In particular, for this next programming assignment we are going to use a kind of collection known as a SortedMap.

As an example, I asked people how we could write a program that would count all of the occurrences of various words in an input file. I had a copy of the text of Moby Dick that we looked at to think about this. I showed some starter code that constructs a Scanner object tied to a file:

        import java.util.*;
        import java.io.*;
        
        public class WordCount {
            public static void main(String[] args) throws FileNotFoundException {
                Scanner console = new Scanner(System.in);
                System.out.print("What is the name of the text file? ");
                String fileName = console.nextLine();
                Scanner input = new Scanner(new File(fileName));

                while (input.hasNext()) {
                    String next = input.next();
                    // process next
                }
            }
        }
Notice that in the loop we use input.next() to read individual words and we have this in a while loop testing against input.hasNext(). I pointed out that we'll have trouble with things like capitalization and punctuation. I said that we should at least turn the string to all lowercase letters so that we don't count Strings like "The" and "the" as different words:

        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            // process next
        }
But I said that dealing with punctuation was more than I wanted to attempt in this program, so I decided that we'd live with the fact that Strings like "the" and "the," and "the." would be considered different words. We're looking for a fairly simple example here, so I didn't want to worry too much about punctuation.

To flesh out this code, we had to think about what kind of data structure to use to keep track of words and their frequencies. One person suggested that we might use arrays or ArrayLists. For example, we could have an ArrayList of words and an ArrayList of counts where element "i" in one corresponds to element "i" in the other. This approach is often described as "parallel arrays." It's not a very object-oriented approach because we really want to associate the word with its counts rather than have a structure that puts all the words together and another that puts all the counts together. Someone suggested that we could make a class for a word/count combination and then have an ArrayList of that. That's true, but Java gives us a better alternative. The collections framework provides a data abstraction known as a map.

The idea behind a map is that it keeps track of key/value pairs. In our case, we want to keep track of word/count pairs (what is the count for each different word). We often store data this way. For example, in the US we often use a person's social security number as a key to get information about them. I would expect that if I talked to the university registrar, they probably have the ability to look up students based on social security number to find their transcript.

In a map, there is only one value for any given key. If you look up a social security number and get three different student transcripts, that would be a problem. With the Java map objects, if you already have an entry in your map for a particular key, then any attempt to put a new key/value pair into the map will overwrite the old mapping.

We looked at an interface in the Java class libraries called Map that is a generic interface. That means that we have to supply type information. It's formal description is Map<K, V>. This is different from the Queue interface in that it has two different types. That's because the map has to know what type of keys you have and what type of values you have. In our case, we have some words (Strings) that we want to associated with some counters (ints). We can't actually use type int because it is a primitive type, but we can use type Integer.

We are going to use a slight variation of Map known as SortedMap. A SortedMap is one that keeps its keys in sorted order. For us, that would mean that the words from the file will be kept in sorted order, which is a nice feature to implement. More importantly, you'll need to use a SortedMap for your homework assignment, so we want to practice using that one.

So our map would be of type SortedMap<String, Integer>. In other words, it's a a map that keeps track of String/Integer pairs (this String goes to this Integer). SortedMap is the name of the interface, but it's not an actual implementation. The implementation we will use is TreeMap. So we can construct a map called "count" to keep track of our counts by saying:

        SortedMap<String, Integer> count = new TreeMap<String, Integer>();
There are only a few methods that we'll be using from the SortedMap interface. The most basic allow you to put something into the map (an operation called put) and to ask the map for the current value of something (an operation called get).

I asked what code we need to record the word in our map. Someone suggested using the put method to assign it to a count of 1. So our loop becomes:

        SortedMap<String, Integer> count = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            count.put(next, 1);
        }
This doesn't quite work, but it's getting closer. Each time we encounter a word, it adds it to our map, associating it with a count of 1. This will figure out what the unique words are, but it won't have the right counts for them.

I asked people to think about what to do if a word has been seen before. In that case, we want to increase its count by 1. That means we have to get the old value of the count and add 1 to it:

        count.get(next) + 1
and make this the new value of the counter:

        count.put(next, count.get(next) + 1);
So we have two different calls on put. We want to call the first one when the word is first seen and call the second one if it's already been seen. Someone suggested using an if/else for this. The only question is what test to use. The SortedMap includes a method called containsKey that tests whether or not a certain value is a key stored in the map. Using this method, we modified our code to be:

        SortedMap<String, Integer> count = new TreeMap<String, Integer>();
        while (input.hasNext()) {
            String next = input.next().toLowerCase();
            if (!count.containsKey(next)) {
                count.put(next, 1);
            } else {
                count.put(next, count.get(next) + 1);
            }
        }
The first time we see a word, we call the put method and say that the map should associate the word with a count of 1. Later we call put again with a higher count. And we keep calling put every time the count goes up. What happens to the old values that we had put in the map previously? The way the map works, each key is associated with only one value. So when you call put a second or third time, you are wiping out the old association. The new key/value pair replaces the old key/value pair in the map.

Then we talked about how to print the results. Clearly we need to iterate over the entries in the map. One way to do this is to request what is known as the "key set". The key set is the set of all keys contained in the map. The Java documentation says that it will be of type Set. We don't have to really worry about this if we use a for-each loop. Remember that a for-each loop iterates over all of the values in a given collection. So we can say:

        for (String word : count.keySet()) {
            // process word
        }
We would read this as, "for each String word that is in count.keySet()..." To process the word, we simply print it out along with its count. How do we get its count? By calling the get method of the map:

        for (String word : count.keySet()) {
            System.out.println(count.get(word) + "\t" + word);
        }
I didn't try to print all of the words in Moby Dick because it would have produced too much output. Instead, I had it show me the counts of words in the program itself. Obviously for large files we want some mechanism to limit the output. At that point I passed out the handout with my commented solution. In that version, I include some extra code that asks for a minimum frequency to use. We ran that on Moby Dick and saw this list of words that occur at least 500 times:

        What is the name of the text file? moby.txt
        Minimum number of occurrences for printing? 500
        4571    a
        1354    all
        587     an
        6182    and
        563     are
        1701    as
        1289    at
        973     be
        1691    but
        1133    by
        1522    for
        1067    from
        754     had
        741     have
        1686    he
        552     him
        2459    his
        1746    i
        3992    in
        512     into
        1555    is
        1754    it
        562     like
        578     my
        1073    not
        506     now
        6408    of
        933     on
        775     one
        675     or
        882     so
        599     some
        2729    that
        14092   the
        602     their
        506     there
        627     they
        1239    this
        4448    to
        551     upon
        1567    was
        644     were
        500     whale
        552     when
        547     which
        1672    with
        774     you
Although I show the output here as being lined up, it didn't look that way in jGRASP. For some reason jGRASP is handling tab characters badly in output.

One final point I made about the SortedMap interface is that you can associate just about anything with just about anything. In the word counting program, we associated strings with integers. You could also associate strings with strings. One thing you can't do is to have multiple associations in a single map. For example, if you decide to associate strings with strings, then any given string can be associated with just a single string. But there's no reason that you can't have the second value be structured in some way. You can associate strings with arrays or strings with ArrayLists.

Then I switched to talking about grammars. We are going to use an approach to describing grammars that is known as a "production system". It is well known to those who study formal linguistics. Computer scientists know a lot about them because we design our own languages like Java. This particular style of production is known as BNF (short for Backus-Naur Form). Each production describes the rules for a particular nonterminal symbol. The nonterminal appears first followed by the symbol "::=" which is usually read as "is composed of". On the right-hand side of the "::=" we have a series of rules separated by the vertical bar character which we read as "or". The idea is that the nonterminal symbol can be replaced by any of the sequences of symbols appearing between vertical bar characters.

We can describe the basic structure of an English sentence as follows:

        <s> ::= <np> <vp>
We would read this as, "A sentence (<s>) is composed of a noun phrase (<np>) followed by a verb phrase (<vp>)." The symbols <s>, <np> and <vp> are known as "nonterminals" in the grammar. That means that we don't expect them to appear in the actual sentences that we form from the grammar.

I pointed out that you can draw a diagram of how to derive a sentence from the BNF grammar. Wikipedia has an example of this under the entry for parse tree.

Then we "drilled down" a bit into what a noun phrase might look like. I suggested that the simplest form of noun phrase would be a proper noun, which I expressed this way:

        <np> ::= <pn>
So then I asked people for examples of proper nouns and we ended up with this rule:

        <pn> ::= Matt | Han Solo | New York | Trogdor | Pikachu | Michael Jackson
Notice that the vertical bar character is being used to separate different possibilities. In other words, we're saying that "a proper noun is either Matt or Han Solo or New York or Trogdor..." These values on the right-hand side are examples of "terminals". In other words, we expect these to be part of the actual sentences that are formed.

I pointed out that it is important to realize that the input is "tokenized" using white space. For example, the text "Han Solo" is broken up into two separate tokens. So it's not a single terminal, it's actually two different terminals.

At this point I mentioned the fact that we're going to use a slight variation of BNF notation. To keep things simple, we'll use just a simple colon in place of the "::=" in the rules above. So our three rules became:

        <s>: <np> <vp>
        <np>: <pn>
        <pn>: Matt | Han Solo | New York | Trogdor | Pikachu | Michael Jackson
I saved this file and ran the program. It read the file and began by saying:

        Available symbols to generate are:
        [<np>, <pn>, <s>]
        What do you want generated (return to quit)?
I pointed out that we are defining a nonterminal to be any symbol that appears to the left of a colon in one of our productions. The input file has three productions and that is why the program is showing three nonterminals that can be generated by the grammar. I began by asking for it to generate 5 of the "<pn>" nonterminal symbol and got something like this:

        Michael Jackson
        New York
        Trogdor
        Michael Jackson
        Matt
In this case, it is simply choosing at random among the various choices for a proper noun. Then I asked it for five of the "<s>" nonterminal symbol and got something like this:

        Michael Jackson <vp>
        Michael Jackson <vp>
        Trogdor <vp>
        Michael Jackson <vp>
        Pikachu <vp>
In this case, it is generating 5 random sentences that involve choosing 5 random proper nouns. So far the program isn't doing anything very interesting, but it's good to understand the basics of how it works.

I also pointed out that these are not proper sentences because they contain the nonterminal symbol <vp>. That's because we never finished our grammar. We haven't yet defined what a verb phrase looks like. Notice that the program doesn't care about whether or not something is enclosed in the less-than and greater-than characters, as in "<vp>". That's a convention that is often followed in describing grammar, but that's not how our program is distinguishing between terminals and nonterminals. As mentioned earlier, anything that appears to the left of a colon is considered a nonterminal and every other token is considered a terminal.

Then I said that there are other kinds of noun phrases than just proper nouns. We might use a word like "the" or "a" followed by a noun. I asked what those words are called and someone said they are determiners. So we added a new rule to the grammar:

        <det>: the | a | an | some | this
Using this, we changed our rule for <np>:

        <np>: <pn> | <det> <n>
Notice how the vertical bar character is used to indicate that a noun phrase is either a proper noun or it's a determiner followed by a noun. This required the addition of a new rule for nouns and I again asked for suggestions from the audience:

        <n>: kite | ball | house | tornado | bulldozer | mat | narwhal | phaser | laser cat
At this point the overall grammar looked like this:

        <s>: <np> <vp>
        <np>: <pn> | <det> <n>
        <pn>: Matt | Han Solo | New York | Trogdor | Pikachu | Michael Jackson
        <det>: the | a | an | some | this
        <n>: kite | ball | house | tornado | bulldozer | mat | narwhal | phaser | laser cat
We saved the file and ran the program again. Because there are five rules in the grammar, it offered five nonterminals to choose from:

        Available symbols to generate are:
        [<det>, <n>, <np>, <pn>, <s>]
        What do you want generated (return to quit)?
Notice that the nonterminals are in alphabetical order, not in the order in which they appear in the file. That's because they are stored as the keys of a SortedMap that keeps the keys in sorted order.

We asked the program to generate 5 <np> and we got something like this:

        this bulldozer
        an house
        New York
        this house
        this narwhal
In this case, it is randomly choosing between the "proper noun" rule and the other rule that involves a determiner and a noun. It is also then filling in the noun or proper noun to form a string of all terminal symbols. I also asked for five of the nonterminal symbol <s> and got something like this:
        New York <vp>
        some kite <vp>
        the kite <vp>
        Han Solo <vp>
        some laser cat <vp>
This is getting better, but we obviously need to include something for verb phrases. We discussed the difference between transitive verbs that take an object (a noun phrase) and intransitive verbs that don't. This led us to add the following new rules:

        <vp>: <tv> <np> | <iv> | <adv> <vp>
        <tv>: hit | hugged | defenstrated | grokked | laughed at | spooned | smoked
        <iv>: died | exploded | imploded | wept | leveled up | evolved
We saved the file and ran the program again and each of these three showed up as choices to generate:

        Available symbols to generate are:
        [<det>, <iv>, <n>, <np>, <pn>, <s>, <tv>, <vp>]
        What do you want generated (return to quit)?
Now when we asked for 10 sentences (10 of the nonterminal <s>), we got more interesting results like these:

        an tornado frolicked
        Pikachu kicked some bulldozer
        some phaser stole a tornado
        the phaser wept
        New York kicked this house
        this phaser frolicked
        Pikachu compiled Han Solo
        this house ignited
        New York touched this narwhal
        a kite ignited
Then we decided to spice up the grammar a bit by adding adjectives. We added a new rule for individual adjectives:

        <adj>: furry | moist | nauseous | shiny | warm | nautical | delicious | superfluous
Then we talked about how to modify our rule for noun phrases. We kept our old combination of a determiner and a noun, but added a new one for an article and a noun with an adjective in the middle:

        <np>: <pn> | <det> <n> | <det> <adj> <n>
But you might want to have more than one adjective. So we introduced a new nonterminal for an adjective phrase:

        <np>: <pn> | <det> <n> | <det> <adjp> <n>
Then we just had to write a production for <adjp>. We want to allow one adjective or two or three, so we could say:

        <adjp>: <adj> | <adj> <adj> | <adj> <adj> <adj>
This is tedious and it doesn't allow four adjectives or five or six. This is a good place to use recursion:

        <adjp>: <adj> | <adj> <adjp>
We are saying that in the simple case or base case, you have one adjective. Otherwise it is an adjective followed by an adjective phrase. This recursive definition is simple, but it allows you to include as many adjectives as you want.

When we ran the program again, we started by asking for 5 adjective phrases and got a result like this:

        warm warm furry moist
        delicious
        superfluous
        nautical nauseous delicious shiny
        nautical
Notice that sometimes we get just one adjective ("delicious") and sometimes we get several because it chooses randomly between the two different rules we introduced for adjective phrase.

This produced even more interesting sentences, as in the following 10:

        New York evolved
        an superfluous mat spooned Trogdor
        the moist kite imploded
        some bulldozer hit an nauseous nauseous laser cat
        an kite exploded
        the shiny narwhal exploded
        the nauseous phaser died
        Michael Jackson evolved
        Pikachu grokked the ball
        the kite hit an phaser
We made one last set of changes to the grammar to include adverbs and ended up with this final version of the grammar:

        <s>: <np> <vp>
        <np>: <pn> | <det> <n> | <det> <adjp> <n>
        <pn>: Matt | Han Solo | New York | Trogdor | Pikachu | Michael Jackson
        <det>: the | a | an | some | this
        <n>: kite | ball | house | tornado | bulldozer | mat | narwhal | phaser | laser cat
        <adj>: furry | moist | nauseous | shiny | warm | nautical | delicious | superfluous
        <adjp>: <adj> | <adj> <adjp>
        <vp>: <tv> <np> | <iv> | <adv> <vp>
        <tv>: hit | hugged | defenstrated | grokked | laughed at | spooned | smoked
        <iv>: died | exploded | imploded | wept | leveled up | evolved
        <adv>: slowly | hungrily | viciously | dangerously
Below are 25 sample sentences generated by the grammar:

        an bulldozer grokked Pikachu
        an nauseous tornado hungrily spooned Han Solo
        an superfluous furry nauseous house died
        this mat defenstrated New York
        a mat slowly dangerously evolved
        the kite hungrily smoked this kite
        this narwhal died
        this mat leveled up
        Matt hungrily defenstrated the furry phaser
        the furry mat defenstrated this mat
        a superfluous nauseous narwhal imploded
        this superfluous ball grokked Pikachu
        an superfluous nauseous warm furry delicious furry phaser hit some bulldozer
        an nauseous nautical warm narwhal slowly slowly died
        an moist moist superfluous narwhal leveled up
        the nauseous superfluous delicious shiny kite wept
        Trogdor hit an house
        an superfluous superfluous phaser leveled up
        Matt spooned some delicious laser cat
        a warm delicious mat dangerously hungrily smoked the narwhal
        Han Solo exploded
        Michael Jackson evolved
        Michael Jackson laughed at the phaser
        this nauseous nauseous phaser smoked Matt
        some nauseous warm ball spooned a kite

Stuart Reges
Last modified: Sun May 2 19:01:53 PDT 2010