The technique we will be using is known as recursive descent parsing. You can think of it as a poor man's approach to compiling. The idea is to express a grammar for the language as a series of BNF productions and then to write a different recursive definition for each production. BNF grammars tend to be mutually recursive, so our Scheme procedures will tend to be mutually recursive.
I spent a few minutes discussing some aspects of this task that may seem a bit odd. Consider, for example, this bit of a Java program:
for (int i=2*3/4 + 2+7; i*x <= 3.7 * y; i = i*3+7)
Suppose you wanted to write a Java interpreter. You would have to somehow
process that input. We begin by tokenizing it. So imagine turning this into a
list of tokens in a Scheme list:
(for ( int i = 2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 ) )
Because parentheses are so integral to Scheme syntax, I said that we'll be
replacing them with tokens lparen and rparen. So the list above would really
be turned into:
(for lparen int i = 2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )
I mentioned that we will want to write various parsing procedures that process
different parts of this list of tokens. In our case, we'll assume that the
tokens you want to pay attention to are at the front of the list. As tokens
are processed, we remove them from the front of the list. We refer to this as
consuming tokens. So imagine that we have consumed tokens up to the
first numerical expression in the list above:
(2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )
How would we go about processing a numerical expression? We see a 2 at the
front of the list and that's an expression, so we might conclude that this
expression evaluates to 2. Someone said that's not right because there are
other tokens that are part of the expression. This is an important point to
understand. We want to write parsing routines that are greedy in the
sense that they will consume as much input as they can that appears at the
front of the list. So we want to consume all of these tokens:
(2 * 3 / 4 + 2 + 7 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )
~~~~~~~~~~~~~~~~~
For the programming assignment I am asking you to write procedures that will
consume a sequence like this and replace it with a single value: the result of
evaluating the expression. In Java, this expression evaluates to 10, so we'd
want our parsing procedure to replace these tokens with the value 10:
(10 ; i * x <= 3.7 * y ; i = i * 3 + 7 rparen )
Notice that it leaves the rest of the input in the list to be processed by
other procedures.I went through an example that I said would serve as a medium hint for the programming assignment. I said it was a bit silly, but it would be a helpful way to understand the assignment. In our sample parsing task, we are going to replace various tokens with a single string value. The lowest level production rule in our grammar is for something called an <item>:
<item> ::= <number> | <symbol>
In this case, we simply replace a number or symbol with its string equivalent.
Our first attempt was this:
(define (parse-item lst)
(let ((first (car lst)))
(cond ((number? first) (cons (number->string first) (cdr lst)))
((symbol? first) (cons (symbol->string first) (cdr lst)))
(else (error "item error")))))
The grammar doesn't describe anything other than a number or symbol, but we
added the final clause because we want to make sure that we produce an error
message if we see something unexpected. In this case, if we were expecting an
item and we see something else, we should complain.We found that this worked fairly well in most cases:
> (parse-item '(3.4 + 9.8))
'("3.4" + 9.8)
> (parse-item '(x y z))
'("x" y z)
> (parse-item '(x))
'("x")
> (parse-item '(#t 3 4))
* item error
> (parse-item '())
car: contract violation
expected: pair?
given: '()
> (parse-item "hello")
car: contract violation
expected: pair?
given: '()
The problem in the last two cases is that the error wasn't generated by our
code. Our code assumes that we are working with a list and that the list is
not empty. In fact the error message gives us a great clue about how to fix
this. The car procedure expects something of type "pair". So we can add an
extra if that checks for this before we call car:
(define (parse-item lst)
(if (not (pair? lst))
(error "item error")
(let ((first (car lst)))
(cond ((number? first) (cons (number->string first) (cdr lst)))
((symbol? first) (cons (symbol->string first) (cdr lst)))
(else (error "item error"))))))
Then we looked at an expansion to the grammar that allows us to combine
sequences of items separated by plus signs:
<sequence> ::= <item> {"+" <item>}
<item> ::= <number> | <symbol>
To make this work, we added a new procedure for parsing a sequence. We know
that a sequence always begins with an item, so we begin by parsing an item:
(define (parse-sequence lst)
(let ([result (parse-item lst)])
...
Then what? The grammar says that this can be followed by zero or more
occurrences of "+" <item>. If it's zero, then we're already done. If not, we
have to keep parsing. How can we tell? We look to see if there is a plus. If
so, we have to do some more parsing:
(define (parse-sequence lst)
(let ([result (parse-item lst)])
(cond ((and (> (length result) 1) (eq? '+ (cadr result)))
...
We included the test on length because we don't want to ask about the cadr
unless we know that the list has a cadr. So if we see a plus, then what? We
know that an <item> comes next, but if we call parse-item, then we need to deal
with the possibility that there are even more occurrences of + <item> after
that. A simpler solution is to call parse-sequence itself, so that it
collapses all of these into a single string:
(define (parse-sequence lst)
(let ([result (parse-item lst)])
(cond ((and (> (length result) 1) (eq? '+ (cadr result)))
(let ([result2 (parse-sequence (cddr result))])
...
At this point, we have parsed the initial item, we have noticed a + after it,
and we have parsed the sequence that comes after. So all we have to do is to
put the two parts together as a single string. I said that for sequences, we
want to collapse them by appending the strings with a comma in between and
surround it with parentheses:
(define (parse-sequence lst)
(let ([result (parse-item lst)])
(cond ((and (> (length result) 1) (eq? '+ (cadr result)))
(let ([result2 (parse-sequence (cddr result))])
(cons (string-append "(" (car result) "," (car result2) ")")
(cdr result2))))
...
And we still need to think about the simple case where we didn't see a +:
(define (parse-sequence lst)
(let ([result (parse-item lst)])
(cond ((and (> (length result) 1) (eq? '+ (cadr result)))
(let ([result2 (parse-sequence (cddr result))])
(cons (string-append "(" (car result) "," (car result2) ")")
(cdr result2))))
(else result))))
This worked fairly well:
> (parse-sequence '(x y z))
'("x" y z)
> (parse-sequence '(x + y z))
'("(x,y)" z)
> (parse-sequence '(3.4 + x + 9.7 + 2.4 & 3.8 + 2.4))
'("(3.4,(x,(9.7,2.4)))" & 3.8 + 2.4)
> (parse-sequence '(3.4 + 2.8 +))
* item error
> (parse-sequence '())
* item error
We then extended the grammar to have a third kind of value that I called
"options" that are separated by & characters:
<options> ::= <sequence> {"&" <item>}
<sequence> ::= <item> {"+" <item>}
<item> ::= <number> | <symbol>
We used parse-sequence as a model to write a procedure parse-sequence. One of
the nice things about recursive descent parsing is that the code comes
directly from the grammar. We don't have to think about how to solve
the recursive relations, we just mirror the grammar. Because
"options" is defined in terms of "sequence", we have parse-options
call parse-sequence. And because "sequence" is defined in terms of
"item", we have parse-sequence call parse-item.I said that for the options operator, we will replace it with -- so that we can distinguish it from the commas we used for sequences. This is the procedure we wrote rather quickly for options because the task is so similar to the sequence problem that we already solved:
(define (parse-options lst)
(let ([result (parse-sequence lst)])
(cond ((and (> (length result) 1) (eq? '& (cadr result)))
(let ([result2 (parse-options (cddr result))])
(cons (string-append "(" (car result) "--" (car result2) ")")
(cdr result2))))
(else result))))
It worked well:
> (parse-options '(3.4 + x + 9.7 + 2.4 & 3.8 + 2.4 & z & 4 + 5))
'("((3.4,(x,(9.7,2.4)))--((3.8,2.4)--(z--(4,5))))")
I mentioned several important things to notice about this. First, the
sequences have a higher level of precedence than the options. In
other words, we group together subexpressions like "(9.7,2.4)" before
we group things with the -- operator. In other words, we use the
grammar as a way to establish higher and lower precedence. Another
thing to notice is that this code groups operators from right to
left. So we group "9.7 + 2.4" first rather than grouping "3.4 + x"
first. In the next lecture we will discuss how to have the operators
group from left to right instead, which is the more common convention
in languages like Java and Python.