The syntax of most programming languages can be described by a context-free grammar. The task of parsing is, given a grammar G and a sentence w generated by that grammar, traverse the derivation (parse) tree for w in some standard order and do something useful at each node. (Note: the tree may not be explicitly present, but the control flow of a parser will correspond to a tree traversal.)
For practical reasons, we want the parser to be deterministic (no backtracking) and we want to examine the source program left-to-right. The two common orderings are
For our project, we'll use top-down, recursive descent parsing. Most compiler generator tools, particularly classic ones like YACC, Bison, and CUP, use bottom-up LALR(k).
At each point (node) in the traversal, perform some semantic action.
The fundamental building block is a production or rule:
nonTerminal ::= sequence of terminal and non-terminal symbols
More formally, a grammar G is a tuple <N, Σ, P, S> where
Standard notations
Definitions (derivation relations)
Definitions (languages)
Def. Grammar G is reduced iff for every production A ::= α in G, there is a derivation S =>* xAz =>* xαz =>* xyz . (i.e., no production is useless)
Convention: We will only used reduced grammars.
Finding a derivation of a sentence w is called parsing w.
General situation: we have completed part of the derivation
S =>* w A α =>* wxy
We are currently at a node corresponding to A.
Basic step. Pick some production
A ::= β1 β2 ... βn
that will properly expand A so that the β's match xy either directly or recursively as the β's are expanded.
Trick: Pick correct production deterministically. How? Need to pick from all possible productions that expand A by looking only at the very next symbol(s), i.e., the beginning of xy.
Ex. Suppose we are at the the non-terminal stmt and we have the following grammar for statements:
stmt ::= return
exp;
| while
(
exp)
stmt | if
(
exp)
stmt
If the remainder of the input, xy, is the sequence of tokens WHILE
,
LPAREN
, ID(x)
, LESS
, ID(y)
, RPAREN
,
..., we should expand stmt using the second alternative.
Def. A LL(k) parser finds a Leftmost derivation of a string w by scanning w from Left to right, looking ahead at most k symbols. Typically, 1-symbol lookahead is sufficient for most practical programming language grammars.
Def. Grammar G is unambiguous iff every w in L(G) has a unique leftmost derivation (equivalently, has a unique rightmost derivation).
Ex. This grammar is ambiguous
x
ifthen
S else
S ifthen
SProof by example: show two distinct parse trees for "ifthen ifthen x
else x
".
Fix: Alter grammar so that S appearing before else can only
derive
complete ifthen-else
pairs (e.g., the grammar for full Java).
(A recursive-descent compiler, like the one in our project, can work around this
ambiguity in other ways. More below.)
Ex. Ambiguous grammar for expressions
+
E | E *
E |
a
| (E)Exercise: give an example that shows this is ambiguous
Fix: Add productions to incorporate precedence and associatively into the grammar.
+
T | T *
F | F a
| (
E)
A recursive-descent parser can be constructed for many grammars by writing a recursive procedure to recognize each non-terminal. Each of these parser procedures will have separate clauses for each of the rules in the grammar for the associated non-terminal. Example. Suppose we have the following grammar for statements
while
(
exp)
stmt if
(
exp)
stmt return
exp;
A recursive-descent parser for this grammar might look something like this.
// parse stmt ::= whilestmt | ifstmt | returnstmt stmt( ) { switch (nextToken.kind) { case TOK_WHILE: whileStmt( ); break; case TOK_IF: ifStmt( ); break; case TOK_RETURN: returnStmt( ); break; } } // parse while ( exp ) stmt whileStmt( ) { skip past "while"; check for "(" and skip past it; exp( ); check for ")" and skip past it; stmt( ); }Functions for other kinds of statements would be similar. The key here is that we can decide what production to use by looking at the very next symbol (a keyword in this case).
The grammar normally has to be unambiguous to construct a recursive-descent parser, but sometimes ambiguities can be handled by ad hoc measures. The standard example is the conditional statement.
if
(
exp)
stmt if
(
exp)
stmt
else
stmtThis is ambiguous, but we can come up with a deterministic parser by always matching the longer form (with the else) if we have a choice. This gives the usual rule that an else pairs up with the closest unmatched if.
// parse if ( exp ) stmt [else stmt] ifStmt( ) { skip past "if" check for "(" and skip past it; exp( ); check for ")" and skip past it; stmt( ); if (next symbol is "else") { skip past "else"; stmt( ); } }
Another thing we need to watch for in a recursive-descent parser is left-recursion in grammar rules. The typical grammar for arithmetic expressions has this problem.
+
term | term *
factor | factor a
| (
expr)
The left recursion is used in the grammar to express that these operators are left-associative, but if we code up a recognizer by just transcribing the grammar rules, we get an infinite recursion.
expr( ) { expr( ); if (next symbol is "+") { skip past "+"; term( ); } }The trick here is to observe that the rule for expr generates the sequence
term +
term
+
... +
term
so we can replace the left recursion with a loop that parses that sequence:
expr( ) { term( ); while (next symbol is "+") { skip past "+"; term( ); } }
These notes treat grammar transformations very casually. There are precise ways of describing whether a predictive parser can be created for a grammar, and rules for transforming grammars to make them suitable. The dragon book (Aho, Sethi & Ullman) has an extended treatment, as do the other standard compiler texts.
The key to doing this precisely is to define a set FIRST(α) for the right-hand side of each production A ::= α. The idea behind this set is that it contains all of the terminal symbols that could appear at the beginning of the remaining input if we expand A into α and complete the rest of the derivation. If the right hand sides of all of the possible productions for A have distinct FIRST() sets, then we can look at the next symbol in the input and pick the production that contains that symbol in its FIRST() set.