CSE341 Notes for Wednesday, 5/8/24

I mentioned that I am planning an assignment where we will parse infix expressions like those you would find in procedural languages like Java and Python.

The technique we will be using is known as recursive descent parsing. You can think of it as a poor man's approach to compiling. The idea is to express a grammar for the language as a series of BNF productions and then to write a different recursive definition for each production. BNF grammars tend to be mutually recursive, so our Scheme procedures will tend to be mutually recursive.

The assignment is going to involve evaluating expressions the way that the Python interpreter known as idle processes them. For example, you might ask idle what it makes of:

        >>> 2 + 3 - 4 * 8 ** 7 - 5---2
        -8388610

The idea will be to process a list of tokens and collapse them into one value, as the interpreter does. For the expression above, we would want to have a Scheme list with each token:

        '(2 + 3 - 4 * 8 ** 7 - 5 - - - 2)

I mentioned that we will want to write various parsing procedures that process different parts of this list of tokens. In our case, we'll assume that the tokens you want to pay attention to are at the front of the list. As tokens are processed, we remove them from the front of the list. We refer to this as consuming tokens.

Each parsing function that we write will be greedy in the sense that it will try to consume as much input as it can from the list and it will return a list with the tokens it consumed replaced by the value it has computed. For example, if you had a parsing function that would process addition, you can imagine that when passed the list above it would replace the tokens (2 + 3) with the value 5:

        '(5 - 4 * 8 ** 7 - 5 - - - 2)

I asked people what kind of issues come up when you think about processing tokens like this. Someone mentioned that precedence matters (which operators are evaluated first). Another issue has to do with whether operators evaluate left-to-right or right-to-left. The most common convention is to evaluate left-to-right, as with this example from Python for evaluating the subtraction operator:

        >>> 7 - 4 - 2
        1

But exponentiation evaluates right-to-left in Python:

        >>> 2 ** 3 ** 4
        2417851639229258349412352

We are going to work with a small grammar and write parsing functions for it. This will serve as a medium hint for the next homework. I said that I am imagining that we have two binary string operators that turn two strings into one in the following manner:

        s1 + s2 -> "(s1, s2)"
        s1 * s2 -> "[s1-s2]"

We wrote functions that would produce the appropriate strings given two strings:

        (define (plus s1 s2)
          (string-append "(" s1 ", " s2 ")"))
        
        (define (times s1 s2)
          (string-append "[" s1 "-" s2 "]"))

We have two operators and we will have a different grammar rule for each:

        <term> ::= <factor> | <factor> * <term>
        <factor> ::= <string> | <string> + <factor>

The first rule says that a term is either a factor or it is a factor followed by * followed by a term. The second rule says that a factor is either a string or it is a string followed by a factor. I mentioned that with the grammar written this way, we end up getting right-to-left evaluation, as you can see in this parse tree that shows how you would determine that "a + b + c" is a legal factor:

                    <factor>
                  /    |     \
               /       |        \
            /          |           \
        <string>       |         <factor>
            |          |        /    |   \
            |          |      /      |    \
            |          |  <string>   |   <factor>
            |          |      |      |      |
            |          |      |      |      |
            |          |      |      |   <string>
            |          |      |      |      |
            a          +      b      +      c

As you can see in the tree, (b + c) is grouped together before we incorporate a. Another thing to notice about our grammar is that + has a higher precedence than * because it is lower in the grammar (closer to an atomic value). We group strings together using + to form factors and then we group factors together using * to form terms.

So our first goal is to write a function called parse-factor that will implement this grammar rule:

        <factor> ::= <string> | <string> + <factor>

We want it to consume as many tokens as possible, evaluating the + operator to collapse the list towards becoming a single string. Someone mentioned that we might be passed an empty list. We could use null? to test this, but I decided instead to call pair? which is the usual way to test for a nonempty list. It is also clear that the list should begin with a string, so we can test to make sure that is true as well. If either of those things is not true, we can generate an error message:

        (define (parse-factor lst)
          (if (not (and (pair? lst) (string? (car lst))))
              (error "invalid syntax")

Now what? We have to think about the two forms a factor can take. It might be just a string or it might be a string followed by + followed by another string. We introduced a test for this second case and introduced some named variables for the first element of the original list and the rest of that list:

              (let ([first (car lst)]
                    [rest (cdr lst)])
                (if (and (pair? rest) (eq? (car rest) '+))

What do we do in this case? We know that we have to process a plus operator. We have the first string to use. The grammar rule says that in the second form, the string is followed by + and then by a factor. In recursive-descent parsing, whenever you come across a rule defined in terms of another rule, you call that other function. In this case it would be parse-factor calling parse-factor:

        (let ([result (parse-factor (cdr rest))]
        ...

We call parse-factor on the cdr of rest because we want to skip the +. Then what? If our parse-factor function works, then it should consume any tokens that involve strings and plus at the front of the list and replace them with a single string. Now we have our two strings and we can use them to define a variable for the text we want to include at the front of the list:

        (let* ([result (parse-factor (cdr rest))]
               [text (plus first (car result))])

We had to change the let to a let* because the we are using result to compute text. What do we do with it? The variable text stores our collapsed string. All we have to do is put this back at the front of the list with anything that might have come after it:

                      
        (cons text (cdr result)))

We had one other case to consider. We formed this intricate if for the case where we have a plus operator to process. But remember that the grammar says that it could be just a simple string. In that case, we can just return the original list because we don't need to collapse anything. Putting this all together we ended up with:

        (define (parse-factor lst)
          (if (not (and (pair? lst) (string? (car lst))))
              (error "invalid syntax")
              (let ([first (car lst)]
                    [rest (cdr lst)])
                (if (and (pair? rest) (eq? (car rest) '+))
                    (let* ([result (parse-factor (cdr rest))]
                           [text (plus first (car result))])
                      (cons text (cdr result)))
                    lst))))

I included three lists to use for testing:

        (define test1 '("a" + "b" + "c" + "d"))
        (define test2 '("a" * "b" * "c" * "d"))
        (define test3 '("a" + "b" * "c" * "d" + "e" * "f" * "g" + "h"))

We saw the expected behavior for each:

        > (parse-factor test1)
        '("(a, (b, (c, d)))")
        > (parse-factor test2)
        '("a" * "b" * "c" * "d")
        > (parse-factor test3)
        '("(a, b)" * "c" * "d" + "e" * "f" * "g" + "h")

With test1 we have nothing but plus operators, so the entire list is collapsed to one string. For test2, it has no plus operator, so there is nothing for parse-factor to do. For test3 we collapse the plus operator at the front of the list, but we don't collapse the later ones. At least not yet. Remember that you are always working with tokens at the front of the list. Other parsing functions process other parts of the list and might make multiple calls on parse-factor to have it deal with plus operators that appear in different parts of the list.

I said that in the next lecture we would write a function parse-term and we would explore the issue of right-to-left evaluation versus left-to-right evaluation of operators.

Stuart Reges

Last modified: Wed May 8 14:25:31 PDT 2024