CSE413 Notes for Wednesday, 2/21/24

I mentioned that I am planning an assignment where we will parse infix expressions like those you would find in procedural languages like Java and Python.

The technique we will be using is known as recursive descent parsing. You can think of it as a poor man's approach to compiling. The idea is to express a grammar for the language as a series of BNF productions and then to write a different recursive definition for each production. BNF grammars tend to be mutually recursive, so our Scheme procedures will tend to be mutually recursive.

I spent a few minutes discussing some aspects of this task that may seem a bit odd. Consider, for example, this bit of a Java program:

        for (int i=2*3/4 + 2+7; i*x <= 3.7 * y; i = i*3+7)

Suppose you wanted to write a Java interpreter. You would have to somehow process that input. We begin by tokenizing it. So imagine turning this into a list of tokens in a Scheme list:

        (for ( int i = 2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 ) )

Because parentheses are so integral to Scheme syntax, I said that we'll be replacing them with tokens lparen and rparen. So the list above would really be turned into:

        (for lparen int i = 2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )

I mentioned that we will want to write various parsing procedures that process different parts of this list of tokens. In our case, we'll assume that the tokens you want to pay attention to are at the front of the list. As tokens are processed, we remove them from the front of the list. We refer to this as consuming tokens. So imagine that we have consumed tokens up to the first numerical expression in the list above:

        (2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )

How would we go about processing a numerical expression? We see a 2 at the front of the list and that's an expression, so we might conclude that this expression evaluates to 2. Someone said that's not right because there are other tokens that are part of the expression. This is an important point to understand. We want to write parsing routines that are greedy in the sense that they will consume as much input as they can that appears at the front of the list. So we want to consume all of these tokens:

        (2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )
         ~~~~~~~~~~~~~~~~~

For the programming assignment I am asking you to write procedures that will consume a sequence like this and replace it with a single value: the result of evaluating the expression. In Java, this expression evaluates to 10, so we'd want our parsing procedure to replace these tokens with the value 10:

        (10 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )

Notice that it leaves the rest of the input in the list to be processed by other procedures.

I went through an example that I said would serve as a medium hint for the programming assignment. I said it was a bit silly, but it would be a helpful way to understand the assignment. In our sample parsing task, we are going to replace various tokens with a single string value. The lowest level production rule in our grammar is for something called an <item>:

        <item> ::= <number> | <symbol>

In this case, we simply replace a number or symbol with its string equivalent. Our first attempt was this:

        (define (parse-item lst)
          (let ((first (car lst)))
            (cond ((number? first) (cons (number->string first) (cdr lst)))
                  ((symbol? first) (cons (symbol->string first) (cdr lst)))
                  (else (error "item error")))))

The grammar doesn't describe anything other than a number or symbol, but we added the final clause because we want to make sure that we produce an error message if we see something unexpected. In this case, if we were expecting an item and we see something else, we should complain.

We found that this worked fairly well in most cases:

        > (parse-item '(3.4 + 9.8))
        '("3.4" + 9.8)
        > (parse-item '(x y z))
        '("x" y z)
        > (parse-item '(x))
        '("x")
        > (parse-item '(#t 3 4))
        * item error
        > (parse-item '())
        car: contract violation
          expected: pair?
          given: '()
        > (parse-item "hello")
        car: contract violation
          expected: pair?
          given: '()

The problem in the last two cases is that the error wasn't generated by our code. Our code assumes that we are working with a list and that the list is not empty. In fact the error message gives us a great clue about how to fix this. The car procedure expects something of type "pair". So we can add an extra if that checks for this before we call car:

        (define (parse-item lst)
          (if (not (pair? lst))
              (error "item error")
              (let ((first (car lst)))
                (cond ((number? first) (cons (number->string first) (cdr lst)))
                      ((symbol? first) (cons (symbol->string first) (cdr lst)))
                      (else (error "item error"))))))

Then we looked at an expansion to the grammar that allows us to combine sequences of items separated by plus signs:

        <sequence> ::= <item> {"+" <item>}
        <item> ::= <number> | <symbol>

To make this work, we added a new procedure for parsing a sequence. We know that a sequence always begins with an item, so we begin by parsing an item:

        (define (parse-sequence lst)
          (let ([result (parse-item lst)])
           ...

Then what? The grammar says that this can be followed by zero or more occurrences of "+" <item>. If it's zero, then we're already done. If not, we have to keep parsing. How can we tell? We look to see if there is a plus. If so, we have to do some more parsing:

        (define (parse-sequence lst)
          (let ([result (parse-item lst)])
            (cond ((and (> (length result) 1) (eq? '+ (cadr result)))
                    ...

We included the test on length because we don't want to ask about the cadr unless we know that the list has a cadr. So if we see a plus, then what? We know that an <item> comes next, but if we call parse-item, then we need to deal with the possibility that there are even more occurrences of + <item> after that. A simpler solution is to call parse-sequence itself, so that it collapses all of these into a single string:

        (define (parse-sequence lst)
          (let ([result (parse-item lst)])
            (cond ((and (> (length result) 1) (eq? '+ (cadr result)))
                   (let ([result2 (parse-sequence (cddr result))])
                    ...

At this point, we have parsed the initial item, we have noticed a + after it, and we have parsed the sequence that comes after. So all we have to do is to put the two parts together as a single string. I said that for sequences, we want to collapse them by appending the strings with a comma in between and surround it with parentheses:

        (define (parse-sequence lst)
          (let ([result (parse-item lst)])
            (cond ((and (> (length result) 1) (eq? '+ (cadr result)))
                   (let ([result2 (parse-sequence (cddr result))])
                     (cons (string-append "(" (car result) "," (car result2) ")")
                           (cdr result2))))
              ...

And we still need to think about the simple case where we didn't see a +:

        (define (parse-sequence lst)
          (let ([result (parse-item lst)])
            (cond ((and (> (length result) 1) (eq? '+ (cadr result)))
                   (let ([result2 (parse-sequence (cddr result))])
                     (cons (string-append "(" (car result) "," (car result2) ")")
                           (cdr result2))))
                  (else result))))

This worked fairly well:

        > (parse-sequence '(x y z))
        '("x" y z)
        > (parse-sequence '(x + y z))
        '("(x,y)" z)
        > (parse-sequence '(3.4 + x + 9.7 + 2.4 & 3.8 + 2.4))
        '("(3.4,(x,(9.7,2.4)))" & 3.8 + 2.4)
        > (parse-sequence '(3.4 + 2.8 +))
        * item error
        > (parse-sequence '())
        * item error

We then extended the grammar to have a third kind of value that I called "options" that are separated by & characters:

        <options> ::= <sequence> {"&" <item>}
        <sequence> ::= <item> {"+" <item>}
        <item> ::= <number> | <symbol>

We used parse-sequence as a model to write a procedure parse-sequence. One of the nice things about recursive descent parsing is that the code comes directly from the grammar. We don't have to think about how to solve the recursive relations, we just mirror the grammar. Because "options" is defined in terms of "sequence", we have parse-options call parse-sequence. And because "sequence" is defined in terms of "item", we have parse-sequence call parse-item.

I said that for the options operator, we will replace it with -- so that we can distinguish it from the commas we used for sequences. This is the procedure we wrote rather quickly for options because the task is so similar to the sequence problem that we already solved:

        (define (parse-options lst)
          (let ([result (parse-sequence lst)])
            (cond ((and (> (length result) 1) (eq? '& (cadr result)))
                   (let ([result2 (parse-options (cddr result))])
                     (cons (string-append "(" (car result) "--" (car result2) ")")
                           (cdr result2))))
                  (else result))))

It worked well:

        > (parse-options '(3.4 + x + 9.7 + 2.4 & 3.8 + 2.4 & z & 4 + 5))
        '("((3.4,(x,(9.7,2.4)))--((3.8,2.4)--(z--(4,5))))")

I mentioned several important things to notice about this. First, the sequences have a higher level of precedence than the options. In other words, we group together subexpressions like "(9.7,2.4)" before we group things with the -- operator. In other words, we use the grammar as a way to establish higher and lower precedence. Another thing to notice is that this code groups operators from right to left. So we group "9.7 + 2.4" first rather than grouping "3.4 + x" first. In the next lecture we will discuss how to have the operators group from left to right instead, which is the more common convention in languages like Java and Python.

Stuart Reges

Last modified: Sat Feb 24 11:20:04 PST 2024