CSE143 Notes for Friday, 4/27/12

We began with a short discussion of the Sierpinski fractal example from the previous lecture. I pointed out that each recursive call was given the three points defining a particular triangle and it would compute the three midpoints of the triangle's segments to be able to specify three smaller triangles inside of the original. It then makes three recursive calls using appropriate combinations of those points. Only in the base case does it actually draw a triangle. As a result, the number of triangles drawn for different levels has this pattern:

        level    triangles
        -------------------
          1          1
          2          3
          3          9
          4         27
          5         81
          6        243
          ...
          n       3^(n-1)
This is an exponential growth. The theoretical fractal has infinitely many triangles. I pointed out that the recursive definition is fairly simple. So it is clear that recursive definitions make it possible to easily define code that has exponential complexity.

Then I spent some time discussing the concept of regular expressions. Regular expressions are used to describe text patterns. Very simple regular expressions define a pattern for just one specific string. More often we use regular expressions that describe a family of strings that all have the same pattern. For example, we might define a regular expression that represents "1 or more spaces." In effect, that regular expression describes infinitely many strings (a string of one space, a string of two spaces, a string of three spaces, and so on).

Learning how to write regular expressions takes time because there are many issues that come up. There are entire books written about how to write regular expressions. Amazon lists over 150 books on regular expressions like Jeffrey Friedl's 500+ page Mastering Regular Expressions. So this is a complex topic that is difficult to cover quickly. But it is also a useful topic to study because regular expressions are so powerful. I mentioned that I ask you to use regular expressions in the next programming assignment, so I wanted to spend some time looking at basic ideas. You won't be required to write your own regular expressions because the assignment writeup provides you with the expressions you'll want to use.

This topic is giving us a glimpse into a much bigger field known as parsing. We often find ourselves wanting to process an input file that has a certain structure. The first step is to tokenize the input. In other words, we have to break it up into individual tokens. We saw in the 142 class that a Scanner can be used to do this. Another approach is to call the split method for a string. The split method returns an array of strings.

I said that I had written a short program for exploring this. You enter a string to split and then enter various regular expressions. The program shows the resulting array of strings. For example, we began by splitting a string using a single space as the regular expression:

        string to split? four score and seven years ago
            regular expression to use (q to quit)?  
            expression = ' '
            result = ['four', 'score', 'and', 'seven', 'years', 'ago']
I pointed out that my program puts quotes around the strings inside this list to make it easier to see what is in each string. The split method does not include any quotes.

We're used to breaking a string into tokens using a space, but you can use any character you want. For example, here is what we got when we split the same string on the characters 's' and 'e':

            regular expression to use (q to quit)? s
            expression = 's'
            result = ['four ', 'core and ', 'even year', ' ago']
            regular expression to use (q to quit)? e
            expression = 'e'
            result = ['four scor', ' and s', 'v', 'n y', 'ars ago']
The regular expression is describing the pattern to use for what are known as delimiters. Delimiters separate tokens. When we tell Java to use the letter 's' as a delimiter, it breaks up the string every time it sees an 's'. Notice that the delimiter is not included in the resulting strings.

Then we looked at what happens when there is more than one space between words:

        string to split? four     score    and       seven     years
            regular expression to use (q to quit)?  
            expression = ' '
            result = ['four', '', '', '', '', 'score', '', '', '', 'and', '', '', '', '', '', '', 'seven', '', '', '', '', 'years']
There are 5 spaces separating the word "four" from the word "score". By saying that our regular expression is a single space, we are telling Java that every one of those spaces is meaningful. As a result, it produces a lot of empty strings because many of those spaces have nothing between them. We got a little closer when we used two spaces as the regular expression:

            regular expression to use (q to quit)?   
            expression = '  '
            result = ['four', '', ' score', '', 'and', '', '', ' seven', '', ' years']
This produced fewer empty strings, but notice what happened to the 5 spaces between "four" and "score". The first pair of spaces indicated that "four" was a token. The next two spaces produced an empty string. And the fifth space is included with the word "score".

This isn't in general what we want. We'd rather say that spaces are important, but Java should treat a sequence of spaces as just one delimiter. We can do that by putting a plus sign after the space in the regular expression:

            regular expression to use (q to quit)?  +
            expression = ' +'
            result = ['four', 'score', 'and', 'seven', 'years']
The plus sign is a modifier that says that we want "1 or more" of whatever comes before it. So in this case, it says that the delimiter is "1 or more spaces."

I then gave an example involving tabs. It's important to realize that there is a special character that represents tab. It's not the same as a sequence of spaces. I typed in the text again with tabs between the words and we found that it didn't split when we used a space as the regular expression:

        string to split? four	score	and	seven	years	ago
            regular expression to use (q to quit)?  
            expression = ' '
            result = ['four	score	and	seven	years	ago']
Notice that we have a single string with tab characters in it as the result. I was able to split it properly by typing in the tab character itself as the regular expression to use:

            regular expression to use (q to quit)? 	
            expression = '	'
            result = ['four', 'score', 'and', 'seven', 'years', 'ago']
Look inside the quotes for the expression and it will look like I typed many characters, but in fact that is a single tab character. This is hard to read, so we normally use the escape sequence \t instead:

            regular expression to use (q to quit)? \t
            expression = '\t'
            result = ['four', 'score', 'and', 'seven', 'years', 'ago']
Then we looked at an example where words were separated by combinations of spaces and dashes. Using what we've seen so far, we can split on one or more spaces (" +") and we can split on one or more dashes ("-+"), but neither of those was quite right:

        string to split? four--score    and --seven-----   years -ago
            regular expression to use (q to quit)?  +
            expression = ' +'
            result = ['four--score', 'and', '--seven-----', 'years', '-ago']
            regular expression to use (q to quit)? -+
            expression = '-+'
            result = ['four', 'score    and ', 'seven', '   years ', 'ago']
We want to split on 1 or more of either a space or a dash. You can indicate that in a regular expression by using square brackets:

            regular expression to use (q to quit)? [ -]+
            expression = '[ -]+'
            result = ['four', 'score', 'and', 'seven', 'years', 'ago']
By forming the regular expression "[ -]", we are saying, "either a space or a dash." By putting a plus after it we are saying, "1 or more of that."

It's more normal to use this kind of expression to split on different kinds of whitespace characters. For example, I entered a string that had multiple spaces and multiple tab characters separating words. In that case, the regular expressions we had been using for "1 or more spaces" and "1 or more tabs" didn't split the string properly:

        string to split? four  		  score    		 and
            regular expression to use (enter to quit)?  +
            expression = ' +'
            result = ['four', '		', 'score', '		', 'and']
            regular expression to use (enter to quit)? \t+
            expression = '\t+'
            result = ['four  ', '  score    ', ' and']
As with the space/dash example, here we want to use the square brackets to say that we want to split on 1 or more of either a space or a tab:

            regular expression to use (enter to quit)? [ \t]+
            expression = '[ \t]+'
            result = ['four', 'score', 'and']
This is an expression that I ask you to use in the programming assignment to separate tokens by whitespace.

I gave a quick glimpse of some other regular expression constructs. For example, suppose that we wanted to split a string that has lots of punctuation characters. If we are trying to identify the words, we could list every possible letter of the alphabet as part of our regular expression:

        string to split (enter to quit)? This&&^^$$- isn't!!!,,,going;;;to  be<><>easy!
            expression = '[abcdefghijklmnopqrstuvwxyz]+'
            result = ['T', '&&^^$$- ', ''', '!!!,,,', ';;;', '  ', '<><>', '!']
But there is an easier way. This is equivalent to saying "a-z" as a way to indicate all 26 of those letters:

            regular expression to use (q to quit)? [a-z]+
            expression = '[a-z]+'
            result = ['T', '&&^^$$- ', ''', '!!!,,,', ';;;', '  ', '<><>', '!']
Someone asked why this was leaving behind the capital 'T'. That's because the uppercase letters are different than the lowercase letters. But there's no reason we can't include them as well:

            regular expression to use (q to quit)? [a-zA-Z]+
            expression = '[a-zA-Z]+'
            result = ['', '&&^^$$- ', ''', '!!!,,,', ';;;', '  ', '<><>', '!']
This regular expression is interesting but it is the exact opposite of what we want. Instead of removing the words, we want to remove the rest. There is an easy way to do that in regular expressions. Inside the brackets, we can include an up-arrow character first as a way to say "not any of these":

            regular expression to use (q to quit)? [^a-zA-Z]+
            expression = '[^a-zA-Z]+'
            result = ['This', 'isn', 't', 'going', 'to', 'be', 'easy']
So this regular expression would be read as, "1 or more of characters that are not a-z or A-Z." Even this isn't quite right because the word "isn't" was split. If we want to include the apostrophe character as potentially part of a word, we can do that:

            regular expression to use (q to quit)? [^a-zA-Z']+
            expression = '[^a-zA-Z']+'
            result = ['This', 'isn't', 'going', 'to', 'be', 'easy']
I also briefly mentioned that regular expressions can be used to configure a Scanner. The Scanner object has a method called useDelimiter that can be used to control how it tokenizes the input. There is an example in the chapter 10 case study where the following line of code uses the regular expression above to instruct the Scanner to ignore everything other than letters and the apostrophe:

        input.useDelimiter("[^a-zA-Z']+");
Then I switched to talking about grammars. We are going to use an approach to describing grammars that is known as a "production system". It is well known to those who study formal linguistics. Computer scientists know a lot about them because we design our own languages like Java. This particular style of production is known as BNF (short for Backus-Naur Form). Each production describes the rules for a particular nonterminal symbol. The nonterminal appears first followed by the symbol "::=" which is usually read as "is composed of". On the right-hand side of the "::=" we have a series of rules separated by the vertical bar character which we read as "or". The idea is that the nonterminal symbol can be replaced by any of the sequences of symbols appearing between vertical bar characters.

We can describe the basic structure of an English sentence as follows:

        <s>::= <np> <vp>
We would read this as, "A sentence (<s>) is composed of a noun phrase (<np>) followed by a verb phrase (<vp>)." The symbols <s>, <np> and <vp> are known as "nonterminals" in the grammar. That means that we don't expect them to appear in the actual sentences that we form from the grammar.

I pointed out that you can draw a diagram of how to derive a sentence from the BNF grammar. Wikipedia has an example of this under the entry for parse tree.

Then we "drilled down" a bit into what a noun phrase might look like. I suggested that the simplest form of noun phrase would be a proper noun, which I expressed this way:

        <np>::= <pn>
So then I asked people for examples of proper nouns and we ended up with this rule:

        <pn>::= Stuart | Timmy | T-Rex | Seattle | Ke$ha | Santorum | Batman
Notice that the vertical bar character is being used to separate different possibilities. In other words, we're saying that "a proper noun is either Stuart or Timmy or T-Rex or Seattle or ...". These values on the right-hand side are examples of "terminals". In other words, we expect these to be part of the actual sentences that are formed.

So at this point we had these three rules became:

        <s>::= <np> <vp>
        <np>::= <pn>
        <pn>::= Stuart | Timmy | T-Rex | Seattle | Ke$ha | Santorum | Batman
I saved this file and ran the program. It read the file and began by saying:

        Available symbols to generate are:
        [<np>, <pn>, <s>]
        What do you want generated (return to quit)?
I pointed out that we are defining a nonterminal to be any symbol that appears to the left of "::=" in one of our productions. The input file has three productions and that is why the program is showing three nonterminals that can be generated by the grammar. I began by asking for it to generate 5 of the "<pn>" nonterminal symbol and got something like this:

        Timmy
        T-Rex
        T-Rex
        Stuart
        Santorum
In this case, it is simply choosing at random among the various choices for a proper noun. Then I asked it for five of the "<s>" nonterminal symbol and got something like this:

        Batman <vp>
        Stuart <vp>
        Timmy <vp>
        Stuart <vp>
        Timmy <vp>
In this case, it is generating 5 random sentences that involve choosing 5 random proper nouns. So far the program isn't doing anything very interesting, but it's good to understand the basics of how it works.

I also pointed out that these are not proper sentences because they contain the nonterminal symbol <vp>. That's because we never finished our grammar. We haven't yet defined what a verb phrase looks like. Notice that the program doesn't care about whether or not something is enclosed in the less-than and greater-than characters, as in "<vp>". That's a convention that is often followed in describing grammar, but that's not how our program is distinguishing between terminals and nonterminals. As mentioned earlier, anything that appears to the left of "::=" is considered a nonterminal and every other token is considered a terminal.

Then I said that there are other kinds of noun phrases than just proper nouns. We might use a word like "the" or "a" followed by a noun. I asked what those words are called and someone said they are determiners. So we added a new rule to the grammar:

        <det>::= a | the | some | that
Using this, we changed our rule for <np>:

        <np>::= <pn> | <det> <n>
Notice how the vertical bar character is used to indicate that a noun phrase is either a proper noun or it's a determiner followed by a noun. This required the addition of a new rule for nouns and I again asked for suggestions from the audience:

        <n>::= goat | car | white house | cat | husky | computer | banana
I pointed out that it is important to realize that the input is "tokenized" using white space. For example, the text "white house" is broken up into two separate tokens. So it's not a single terminal, it's actually two different terminals.

At this point the overall grammar looked like this:

        <s>::= <np> <vp>
        <np>::= <pn> | <det> <n>
        <pn>::= Stuart | Timmy | T-Rex | Seattle | Ke$ha | Santorum | Batman
        <det>::= a | the | some | that
        <n>::= goat | car | white house | cat | husky | computer | banana
We saved the file and ran the program again. Because there are five rules in the grammar, it offered five nonterminals to choose from:

        Available symbols to generate are:
        [<det>, <n>, <np>, <pn>, <s>]
        What do you want generated (return to quit)?
Notice that the nonterminals are in alphabetical order, not in the order in which they appear in the file. That's because they are stored as the keys of a SortedMap that keeps the keys in sorted order.

We asked the program to generate 5 <np> and we got something like this:

        that banana
        Stuart
        Timmy
        a cat
        the banana
In this case, it is randomly choosing between the "proper noun" rule and the other rule that involves a determiner and a noun. It is also then filling in the noun or proper noun to form a string of all terminal symbols. I also asked for five of the nonterminal symbol <s> and got something like this:
        that car <vp>
        Seattle <vp>
        a white house <vp>
        that banana <vp>
        that goat <vp>
This is getting better, but we obviously need to include something for verb phrases. We discussed the difference between transitive verbs that take an object (a noun phrase) and intransitive verbs that don't. This led us to add the following new rules:

        <vp>::= <tv> <np> | <iv>
        <iv>::= laughed | died | surged | exploded | coded | ran
        <tv>::= interrogated | desanguinated | defenestrated | washed | pushed | ambushed
We saved the file and ran the program again and each of these three showed up as choices to generate:

        Available symbols to generate are:
        [<det>, <iv>, <n>, <np>, <pn>, <s>, <tv>, <vp>]
        What do you want generated (return to quit)?
Now when we asked for 10 sentences (10 of the nonterminal <s>), we got more interesting results like these:

        Stuart coded
        the computer desanguinated some husky
        Santorum died
        Stuart ambushed a banana
        Batman ran
        a goat exploded
        Timmy ambushed Santorum
        T-Rex interrogated Timmy
        Ke$ha laughed
        Timmy interrogated that goat
Then we decided to spice up the grammar a bit by adding adjectives. We added a new rule for individual adjectives:

        <adj>::= green | smelly | feral | loving | fluffy | shiny | sparkly
Then we talked about how to modify our rule for noun phrases. We kept our old combination of a determiner and a noun, but added a new one for an article and a noun with an adjective in the middle:

        <np>::= <pn> | <det> <n> | <det> <adj> <n>
But you might want to have more than one adjective. So we introduced a new nonterminal for an adjective phrase:

        <np>::= <pn> | <det> <n> | <det> <adjp> <n>
Then we just had to write a production for <adjp>. We want to allow one adjective or two or three, so we could say:

        <adjp>::= <adj> | <adj> <adj> | <adj> <adj> <adj>
This is tedious and it doesn't allow four adjectives or five or six. This is a good place to use recursion:

        <adjp>::= <adj> | <adj> <adjp>
We are saying that in the simple case or base case, you have one adjective. Otherwise it is an adjective followed by an adjective phrase. This recursive definition is simple, but it allows you to include as many adjectives as you want.

When we ran the program again, we started by asking for 5 adjective phrases and got a result like this:

        that feral fluffy feral cat
        the white house
        some goat
        that shiny cat
        the smelly fluffy fluffy loving fluffy husky
Notice that sometimes we get just one adjective ("shiny") and sometimes we get several because it chooses randomly between the two different rules we introduced for adjective phrase.

This produced even more interesting sentences, as in the following 10:

        Ke$ha died
        a sparkly white house surged
        the goat desanguinated a shiny smelly cat
        that smelly shiny shiny white house ambushed Seattle
        a feral feral banana exploded
        Seattle laughed
        some banana interrogated the goat
        the goat surged
        a feral green white house ambushed that fluffy feral smelly shiny sparkly husky
        that banana surged
We made one last set of changes to the grammar to include adverbs and ended up with this final version of the grammar:

        <s>::= <np> <vp>
        <np>::= <pn> | <det> <n> | <det> <adjp> <n>
        <pn>::= Stuart | Timmy | T-Rex | Seattle | Ke$ha | Santorum | Batman
        <det>::= a | the | some | that
        <n>::= goat | car | white house | cat | husky | computer | banana
        <vp>::= <tv> <np> | <iv> | <adv> <vp>
        <iv>::= laughed | died | surged | exploded | coded | ran
        <tv>::= interrogated | desanguinated | defenestrated | washed | pushed | ambushed
        <adj>::= green | smelly | feral | loving | fluffy | shiny | sparkly
        <adjp>::= <adj> | <adj> <adjp>
        <adv>::= quickly | timidly | energeticaly | mercilessly | efficiently
Below are 25 sample sentences generated by the grammar:

        the husky mercilessly ambushed Stuart
        some car pushed T-Rex
        Ke$ha efficiently quickly energeticaly energeticaly exploded
        that smelly goat defenestrated the car
        some husky exploded
        T-Rex ran
        that fluffy shiny green goat desanguinated the cat
        that loving white house mercilessly defenestrated some banana
        some loving loving white house timidly died
        some computer died
        Santorum washed T-Rex
        T-Rex quickly desanguinated Timmy
        that banana desanguinated the loving computer
        Stuart mercilessly died
        T-Rex exploded
        Seattle coded
        Stuart surged
        Seattle laughed
        Seattle ran
        a white house energeticaly quickly quickly surged
        some goat interrogated Stuart
        a smelly goat desanguinated that shiny shiny husky
        Ke$ha energeticaly efficiently efficiently surged
        that white house desanguinated some fluffy shiny sparkly sparkly sparkly feral banana
        a sparkly banana energeticaly quickly quickly energeticaly mercilessly surged

Stuart Reges
Last modified: Fri Apr 27 18:22:52 PDT 2012