A compiler front-end can be constructed systematically using the syntax of the language. In principle, we could give a single context-free grammar defining the language down to the character level. For example, a simple language for arithmetic expressions could be defined as follows.
expr ::= term | expr + term
term ::= factor | term * factor
factor ::= ( expr ) | id | int
id ::= ...
int ::= ...
We would need to add more rules to describe that an id, for example, consists of a letter and underscore followed by 0 or more letters, digits, and underscores. We'd need to do the same for int, and we'd also have to include rules defining comments and whitespace. Once that's done we could construct a complete parser for this concrete grammar.
But the result would be complex and slow. Better: separate the character handling details from the task of parsing the input tokens and determining the grammatical structure. Advantages:
The scanner is also usually responsible for reporting lexical errors in the input (for example a=!b; in a Java/C/C++ program, where =! is not a legal token).
Idea: We want a distinct token kind (lexical class) to represent each distinct terminal symbol that appears in the programming language's grammar. Tokens include:
if
, else
, while
, for
,
return
, void
, int
, ...). Each of these should also be a
distinct lexical class.Tokens are not character strings. They should be implemented as a data structure containing a small integer constant (or enumeration) to identify the lexical class, plus, in the case of numbers and identifiers, the actual number or identifier. (More details below.) Also, whitespace and comments are not included as tokens - the parser should never see these.
Normally, when there is a choice the scanner picks the longest possible string to match a token. For example, given return distance != total; in a program, the scanner should normally analyze this as 5 tokens (RETURN, ID(distance), NOTEQUAL, ID(total), SCOLON), rather than returning individual letters, parts of identifiers, or the ! and = as separate tokens.
The lexical grammar of most programming languages can be described by regular expressions, and tokens can be recognized by an appropriate finite automaton. The lexical grammar can either be used to guide construction of a hand-crafted scanner, or can be used as input to a scanner generator program like Lex or JLex (which we will not use in CSE 413).
Definitions: Regular expressions are defined over some alphabet Σ. If re is a regular expression, then L(re) is the language (set of strings) generated by that regular expression. The fundamental definitions are
re | L(re) | notes |
a | { a } | for each character a contained in Σ |
ε | { ε } | the null string |
Ø | { } | empty language |
If r and s are regular expressions, then the following operations can be applied to them
re | L(re) | notes |
rs | L(r)L(s) | concatenation |
r | s | L(r) ∪ L(s) | combination (union) |
r* | L(r)* | 0 or more occurrences (Kleene closure) |
These basic operations generate all regular expressions. There are, however, some convenient abbreviations that are commonly used.
abbreviation | meaning | notes |
r+ | (rr*) | 1 or more occurrences |
r? | (r | ε ) | 0 or 1 occurrence |
[a-z] | (a | b | ... | z) | any character in given range |
[abc] | (a | b | c) | any of the given characters |
Examples:
re | meaning |
+ | plus symbol |
!= | ! followed by = |
return | keyword return (6 adjacent characters) |
[a-zA-Z][a-zA-Z0-9_]* | typical programming language identifier |
Finally, some systems allow abbreviations of the form name ::= re to make it easier to write definitions.
Example: Regular expression description of numeric constants in a typical programming language (including integer and floating point).
digit ::= [0-9]
digits ::= digit+
number ::= digits ( . digits )? ( [eE] (+ | -)? digits )?
A finite automata can be constructed directly from this definition to recognize numeric constants.
What follows is a sketch of how a hand-written scanner might be implemented. A key decision is the representation of the lexical tokens. In Java, we can use something like the following class.
public class Token { // representation of a lexical token public int kind; // lexical class of this Token public int val; // if kind is INT, val = integer value public String id; // if kind is ID, id = the identifier // lexical classes public static final int EOF = 0; // end-of-file public static final int ID = 1; // identifier public static final int INT = 2; // integer constant public static final int LPAREN = 3; // ( public static final int RPAREN = 4; // ) public static final int SCOLON = 5; // ; public static final int RETURN = 6; // return keyword public static final int WHILE = 7; // while keyword ... // etc. }
The token class may well contain some other details, but it is
basically a simple data structure. The main difference when this is implemented
in C is that we would specify Token
as a struct and the individual
lexical classes as symbolic constants using #define
, and we would
need to allocate space for the identifier string, either statically in the token
(good enough for our project) or dynamically.
The key method needed in a scanner is one that yields the next token from
the input when it is called by the parser. Here is a sketch of how that could
look in Java, glossing over
all of the details of file handling, character input, etc., and taking liberties
with
syntax. A C version is quite similar, except that the result token would be a
local variable (struct
) instead of a newly allocated object on the heap.
public class Scanner { // lexical scanner object private char nextch; // = next unprocessed character from source program ... // Yield next token from input program public Token nextToken( ) { Token result = new Token( ); // result Token (local variable in C) advance nextch past any whitespace and comments; switch(nextch) { case '(': result.kind = Token.LPAREN; advance nextch; break; case ')': result.kind = Token.RPAREN; advance nextch; break; case '!': // ! or != advance nextch; if (nextch == '=') { result.kind = Token.NOTEQUAL; advance nextch; } else result.kind = Token.NOT; break; ... case '0': case '1': ... case '9': // integer constant read remaining digits into a String s; result.kind = Token.INT; result.val = Integer(s).intValue(); // use atoi in C for conversion break; case 'a': case 'b': ... case 'z': case 'A': case 'B': ... case 'Z': // identifier or keyword read rest of identifier into a String s; if (s is a keyword) result.kind = Token.keywordkind; else { result.kind = Token.ID; result.id = s; } break; } // end switch return result; } // end nextToken }