CSE341 Notes for Wednesday, 5/13/09

I mentioned that I will be reusing an assignment in which we'll explore how to write an interpreter for the BASIC programming language. It is split it into two parts. The first part involves parsing certain expressions.

The technique we will be using is known as recursive descent parsing. You can think of it as a poor man's approach to compiling. If you enjoy what we do here, then you should consider taking the compilers course and learning about all of the techniques we have developed to do this in a much more efficient manner. But for our purposes, recursive descent parsing will work just fine.

The idea is to express a grammar for the language as a series of BNF productions and then to write a different recursive definition for each production. BNF grammars tend to be mutually recursive, so our Scheme functions will tend to be mutually recursive.

I spent a few minutes discussing some aspects of this task that may seem a bit odd. Consider, for example, this bit of a Java program:

        for (int i=2*3/4 + 2+7; i*x <= 3.7 * y; i = i*3+7)

Suppose you wanted to write a Java interpreter. You would have to somehow process that input. We begin by tokenizing it. So imagine turning this into a list of tokens in a Scheme list:

        (for ( int i = 2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 ) )

Because parentheses are so integral to Scheme syntax, I said that we'll be replacing them with tokens lparen and rparen. So the list above would really be turned into:

        (for lparen int i = 2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )

I mentioned that we will want to write various parsing function that process different parts of this list of tokens. In our case, we'll assume that the tokens you want to pay attention to are at the front of the list. As tokens are processed, we remove them from the front of the list. We refer to this as consuming tokens. So imagine that we have consumed tokens up to the first numerical expression in the list above:

        (2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )

How would we go about processing a numerical expression? We see a 2 at the front of the list and that's an expression, so we might conclude that this expression evaluates to 2. Someone said that's not right because there are other tokens that are part of the expression. This is an important point to understand. We want to write parsing routines that are greedy in the sense that they will consume as much input as they can that appears at the front of the list. So we want to consume all of these tokens:

        (2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )
         ~~~~~~~~~~~~~~~~~

For the programming assignment I am asking you to write functions that will consume a sequence like this and replace it with a single value: the result of evaluating the expression. In Java, this expression evaluates to 10, so we'd want our parsing function to replace these tokens with the value 10:

        (10 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )

Notice that it leaves the rest of the input in the list to be processed by other functions.

I went through an example that I said would serve as a medium hint for the programming assignment. I said it was a bit silly, but it would be a helpful way to understand the assignment. In our sample parsing task, we are going to replace various tokens with a single string value. The lowest level production rule in our grammar is for something called an <item>:

        <item> ::= <number> | <symbol>

In this case, we simply replace a number or symbol with its string equivalent. Our first attempt was this:

        (define (parse-item lst)
          (let ((first (car lst)))
            (cond ((number? first) (cons (number->string first) (cdr lst)))
                  ((symbol? first) (cons (symbol->string first) (cdr lst)))
                  (else (error "item error")))))

The grammar doesn't describe anything other than a number or symbol, but we added the final clause because we want to make sure that we produce an error message if we see something unexpected. In this case, if we were expecting an item and we see something else, we should complain.

We found that this worked fairly well in all but one case:

        > (parse-item '(3.4 + 9.8))
        ("3.4" + 9.8)
        > (parse-item '(x y z))
        ("x" y z)
        > (parse-item '(x))
        ("x")
        > (parse-item '(#t 3 4))
        * item error
        > (parse-item '())
        * car: expects argument of type ; given ()
        > (parse-item "hello")
        * car: expects argument of type ; given "hello"

The problem in the last two cases is that the error wasn't generated by our code. Our code assumes that we are working with a list and that the list is not empty. In fact the error message gives us a great clue about how to fix this. The car function expects something of type "pair". So we can add an extra if that checks for this before we call car:

        (define (parse-item lst)
          (if (not (pair? lst))
              (error "item error")
              (let ((first (car lst)))
                (cond ((number? first) (cons (number->string first) (cdr lst)))
                      ((symbol? first) (cons (symbol->string first) (cdr lst)))
                      (else (error "item error"))))))

Then we looked at an expansion to the grammar that allows us to combine sequences of items separated by plus signs:

        <sequence> ::= <item> {"+" <item>}
        <item> ::= <number> | <symbol>

To make this work, we added a new function for parsing a sequence. We know that a sequence always begins with an item, so we begin by parsing an item:

        (define (parse-sequence)
          (let ((result (parse-item lst)))
           ...

Then what? The grammar says that this can be followed by zero or more occurrences of "+" <item>. If it's zero, then we're already done. If not, we have to keep parsing. How can we tell? We look to see if there is a plus. If so, we have to do some more parsing:

        (define (parse-sequence)
          (let ((result (parse-item lst)))
            (cond ((and (> (length result) 1) (eq? '+ (cadr result)))
                    ...

We included the test on length because we don't want to ask about the cadr unless we know that the list has a cadr. So if we see a plus, then what? We know that an <item> comes next, but if we call parse-item, then we need to deal with the possibility that there are even more occurrences of + <item> after that. A simpler solution is to call parse-sequence itself, so that it collapses all of these into a single string:

        (define (parse-sequence)
          (let ((result (parse-item lst)))
            (cond ((and (> (length result) 1) (eq? '+ (cadr result)))
                   (let ((result2 (parse-sequence (cddr result))))
                    ...

At this point, we have parsed the initial item, we have noticed a + after it, and we have parsed the sequence that comes after. So all we have to do is to put the two parts together as a single string. I said that for sequences, we want to collapse them by appending the strings with a comma in between:

        (define (parse-sequence)
          (let ((result (parse-item lst)))
            (cond ((and (> (length result) 1) (eq? '+ (cadr result)))
                   (let ((result2 (parse-sequence (cddr result))))
                     (cons (string-append (car result) "," (car result2))
                           (cdr result2))))
              ...

And we still need to think about the simple case where we didn't see a +:

        (define (parse-sequence)
          (let ((result (parse-item lst)))
            (cond ((and (> (length result) 1) (eq? '+ (cadr result)))
                   (let ((result2 (parse-sequence (cddr result))))
                     (cons (string-append (car result) "," (car result2))
                           (cdr result2))))
                  (else result))))

This worked fairly well:

        > (parse-sequence '(x y z))
        ("x" y z)
        > (parse-sequence '(x + y z))
        ("x,y" z)
        > (parse-sequence '(3.4 + x + 9.7 + 2.4 & 3.8 + 2.4))
        ("3.4,x,9.7,2.4" & 3.8 + 2.4)
        > (parse-sequence '(3.4 + 2.8 +))
        * item error
        > (parse-sequence '())
        * item error

Stuart Reges

Last modified: Tue May 19 13:24:24 PDT 2009