CSE 413 Autumn 2007

Notes on Context Free Grammars and Parsing

Overview

The syntax of most programming languages can be described by a context-free grammar. The task of parsing is, given a grammar G and a sentence w generated by that grammar, traverse the derivation (parse) tree for w in some standard order and do something useful at each node. (Note: the tree may not be explicitly present, but the control flow of a parser will correspond to a tree traversal.)

Standard order

For practical reasons, we want the parser to be deterministic (no backtracking) and we want to examine the source program left-to-right. The two common orderings are

Top-down: Start at goal symbol of the grammar; typically do a depth-first left-to-right traversal (leftmost derivation). [LL(k)]
Bottom-up: Start at leaves and traverse a rightmost derivation in reverse. [LR(k) and subsets like LALR(k)]

For our project, we'll use top-down, recursive descent parsing. Most compiler generator tools, particularly classic ones like YACC, Bison, and CUP, use bottom-up LALR(k).

Something useful

At each point (node) in the traversal, perform some semantic action.

Construct nodes of a full parse tree (rare)
Construct nodes of an abstract syntax tree (typical in production compilers)
Generate target code on the fly (1-pass compiler, not common in production compilers, but it's the method of choice for our project)
Write a linear (flattened) tree which will be used as input to next part of the compiler (mostly historical, needed to compile non-trivial programs on machines limited amounts of storage).

Context-Free Grammars

The fundamental building block is a production or rule:

nonTerminal ::= sequence of terminal and non-terminal symbols

More formally, a grammar G is a tuple <N, Σ, P, S> where

N is a finite set of nonterminal symbols
Σ is a finite set of terminal symbols (the alphabet)
P is a finite set of productions (a subset of N x (N ∪ Σ)* ), i.e., the rules of the grammar.
S is a distinguished element of N known as the start symbol. Convention: if the start symbol is not explicitly specified, the nonterminal on the left of the first rule in the grammar is taken to be the start symbol.

Standard notations

a, b, c elements of Σ
w, x, y, z elements of Σ*
A, B, C elements of N
X, Y, Z elements of N ∪ Σ
α, β, γ elements of (N ∪ Σ)*
A -> α or A ::= α if <A, α> is a member of P (productions)

Definitions (derivation relations)

α A γ => α β γ iff A ::= β in P (derives)
A =>* w if there is a chain of productions starting with A that generates w (transitive closure)
w A γ =>_(lm) w β γ iff A ::= β in P (derives leftmost)
α A w =>_(rm) α β w iff A ::= β in P (derives rightmost)

Definitions (languages)

For A in N, L(A) = { w | A =>* w}
If S is the start symbol of grammar G, we define L(G) = L(S)

Def. Grammar G is reduced iff for every production A ::= α in G, there is a derivation S =>* xAz =>* xαz =>* xyz . (i.e., no production is useless)

Convention: We will only used reduced grammars.

Parsing

Finding a derivation of a sentence w is called parsing w.

Top-down LL(k) parsing

General situation: we have completed part of the derivation

S =>* w A α =>* wxy

We are currently at a node corresponding to A.

Basic step. Pick some production

A ::= β₁ β₂ ... β_n

that will properly expand A so that the β's match xy either directly or recursively as the β's are expanded.

Trick: Pick correct production deterministically. How? Need to pick from all possible productions that expand A by looking only at the very next symbol(s), i.e., the beginning of xy.

Ex. Suppose we are at the the non-terminal stmt and we have the following grammar for statements:

stmt ::= return exp; | while (exp) stmt | if (exp) stmt

If the remainder of the input, xy, is the sequence of tokens WHILE, LPAREN, ID(x), LESS, ID(y), RPAREN, ..., we should expand stmt using the second alternative.

Def. A LL(k) parser finds a Leftmost derivation of a string w by scanning w from Left to right, looking ahead at most k symbols. Typically, 1-symbol lookahead is sufficient for most practical programming language grammars.

Ambiguity

Def. Grammar G is unambiguous iff every w in L(G) has a unique leftmost derivation (equivalently, has a unique rightmost derivation).

Ex. This grammar is ambiguous

S ::= x
S ::= ifthen S else S
S ::= ifthen S

Proof by example: show two distinct parse trees for "ifthen ifthen x else x".

Fix: Alter grammar so that S appearing before else can only derive complete ifthen-else pairs (e.g., the grammar for full Java). (A recursive-descent compiler, like the one in our project, can work around this ambiguity in other ways. More below.)

Ex. Ambiguous grammar for expressions

E ::= E + E | E * E | a | (E)

Exercise: give an example that shows this is ambiguous

Fix: Add productions to incorporate precedence and associatively into the grammar.

E ::= E + T | T
T ::= T * F | F
F ::= a | (E)

Recursive-Descent Parsers

A recursive-descent parser can be constructed for many grammars by writing a recursive procedure to recognize each non-terminal. Each of these parser procedures will have separate clauses for each of the rules in the grammar for the associated non-terminal. Example. Suppose we have the following grammar for statements

stmt ::= whilestmt | ifstmt | returnstmt
whilestmt ::= while (exp) stmt
ifstmt ::= if (exp) stmt
returnstmt ::= return exp;

A recursive-descent parser for this grammar might look something like this.

  // parse stmt ::=  whilestmt | ifstmt | returnstmt
  stmt( ) {
     switch (nextToken.kind) {
        case TOK_WHILE:  whileStmt( );
                         break;
        case TOK_IF:     ifStmt( );
                         break;
        case TOK_RETURN: returnStmt( );
                         break;
     }
  }
  
  // parse while ( exp ) stmt
  whileStmt( ) {
     skip past "while";
     check for "(" and skip past it;
     exp( );
     check for ")" and skip past it;
     stmt( );
  }

Functions for other kinds of statements would be similar. The key here is that we can decide what production to use by looking at the very next symbol (a keyword in this case).

The grammar normally has to be unambiguous to construct a recursive-descent parser, but sometimes ambiguities can be handled by ad hoc measures. The standard example is the conditional statement.

ifStmt ::= if (exp) stmt
ifStmt ::= if (exp) stmt else stmt

This is ambiguous, but we can come up with a deterministic parser by always matching the longer form (with the else) if we have a choice. This gives the usual rule that an else pairs up with the closest unmatched if.

  // parse if ( exp ) stmt [else stmt]
  ifStmt( ) {
      skip past "if"
      check for "(" and skip past it;
      exp( );
      check for ")" and skip past it;
      stmt( );
      if (next symbol is "else") {
         skip past "else";
         stmt( );
      }
   }

Left recursion

Another thing we need to watch for in a recursive-descent parser is left-recursion in grammar rules. The typical grammar for arithmetic expressions has this problem.

expr ::= expr + term | term
term ::= term * factor | factor
factor ::= a | (expr)

The left recursion is used in the grammar to express that these operators are left-associative, but if we code up a recognizer by just transcribing the grammar rules, we get an infinite recursion.

   expr( ) {
      expr( );
      if (next symbol is "+") {
         skip past "+";
         term( );
      }
   }

The trick here is to observe that the rule for expr generates the sequence

term + term + ... + term

so we can replace the left recursion with a loop that parses that sequence:

   expr( ) {
      term( );
      while (next symbol is "+") {
         skip past "+";
         term( );
      }
   }

Formalities

These notes treat grammar transformations very casually. There are precise ways of describing whether a predictive parser can be created for a grammar, and rules for transforming grammars to make them suitable. The dragon book (Aho, Sethi & Ullman) has an extended treatment, as do the other standard compiler texts.

The key to doing this precisely is to define a set FIRST(α) for the right-hand side of each production A ::= α. The idea behind this set is that it contains all of the terminal symbols that could appear at the beginning of the remaining input if we expand A into α and complete the rest of the derivation. If the right hand sides of all of the possible productions for A have distinct FIRST() sets, then we can look at the next symbol in the input and pick the production that contains that symbol in its FIRST() set.