CSE 413 Autumn 2006

Lexical Analysis

Overview

A compiler front-end can be constructed systematically using the syntax of the language. In principle, we could give a single context-free grammar defining the language down to the character level. For example, a simple language for arithmetic expressions could be defined as follows.

expr ::= term | expr + term
term ::= factor | term * factor
factor ::= ( expr ) | id | int
id ::= ...
int ::= ...

We would need to add more rules to describe that an id, for example, consists of a letter and underscore followed by 0 or more letters, digits, and underscores. We'd need to do the same for int, and we'd also have to include rules defining comments and whitespace. Once that's done we could construct a complete parser for this concrete grammar.

But the result would be complex and slow. Better: separate the character handling details from the task of parsing the input tokens and determining the grammatical structure. Advantages:

Simplicity. The parser sees a much simpler input language consisting of lexical tokens without extraneous details like comments and whitespace. The output from the scanner can include additional tokens that do not appear in the source language to help the parser. Example: an EOF token to indicate the end of input.
Speed. The lexical grammar of most programming languages can be described by regular expressions, and a scanner can use simple, fast finite automata techniques to recognize the tokens (with some ad hoc fooling around for languages with unusual lexical structure). Using a full context-free parser to read the characters is much slower.

Scanner

Input: Character stream representation of the program
Output: Token stream (often delivered on demand when parser calls "nextToken" routine in the scanner). Whitespace and comments are skipped and do not appear in the token stream.

The scanner is also usually responsible for reporting lexical errors in the input (for example a=!b; in a Java/C/C++ program, where =! is not a legal token).

Tokens

Idea: We want a distinct token kind (lexical class) to represent each distinct terminal symbol that appears in the programming language's grammar. Tokens include:

Operators and other punctuation (+ - * ( ) { } ; ...). Each of these is distinct lexical class
Keywords (if, else, while, for, return, void, int, ...). Each of these should also be a distinct lexical class.
Numeric constants. All integers are instances of a single lexical class. The token also needs to include the actual integer value so it can be processed later.
Identifiers. As with integers, there will be a single lexical class id for all non-keyword identifiers, but the token also needs to include the identifier as a character string for later processing.

Tokens are not character strings. They should be implemented as a data structure containing a small integer constant (or enumeration) to identify the lexical class, plus, in the case of numbers and identifiers, the actual number or identifier. (More details below.) Also, whitespace and comments are not included as tokens - the parser should never see these.

Principle of Longest Match

Normally, when there is a choice the scanner picks the longest possible string to match a token. For example, given return distance != total; in a program, the scanner should normally analyze this as 5 tokens (RETURN, ID(distance), NOTEQUAL, ID(total), SCOLON), rather than returning individual letters, parts of identifiers, or the ! and = as separate tokens.

Regular Expressions

The lexical grammar of most programming languages can be described by regular expressions, and tokens can be recognized by an appropriate finite automaton. The lexical grammar can either be used to guide construction of a hand-crafted scanner, or can be used as input to a scanner generator program like Lex or JLex (which we will not use in CSE 413).

Definitions: Regular expressions are defined over some alphabet Σ. If re is a regular expression, then L(re) is the language (set of strings) generated by that regular expression. The fundamental definitions are

re	L(re)	notes
a	{ a }	for each character a contained in Σ
ε	{ ε }	the null string
Ų	{ }	empty language

If r and s are regular expressions, then the following operations can be applied to them

re	L(re)	notes
rs	L(r)L(s)	concatenation
r \| s	L(r) ∪ L(s)	combination (union)
r*	L(r)*	0 or more occurrences (Kleene closure)

These basic operations generate all regular expressions. There are, however, some convenient abbreviations that are commonly used.

abbreviation	meaning	notes
r+	(rr*)	1 or more occurrences
r?	(r \| ε )	0 or 1 occurrence
[a-z]	(a \| b \| ... \| z)	any character in given range
[abc]	(a \| b \| c)	any of the given characters

Examples:

re	meaning
+	plus symbol
!=	! followed by =
return	keyword return (6 adjacent characters)
[a-zA-Z][a-zA-Z0-9_]*	typical programming language identifier

Finally, some systems allow abbreviations of the form name ::= re to make it easier to write definitions.

Example: Regular expression description of numeric constants in a typical programming language (including integer and floating point).

digit ::= [0-9]
digits ::= digit+
number ::= digits ( . digits )? ( [eE] (+ | -)? digits )?

A finite automata can be constructed directly from this definition to recognize numeric constants.

Token Representation

What follows is a sketch of how a hand-written scanner might be implemented. A key decision is the representation of the lexical tokens. In Java, we can use something like the following class.

   public class Token {   // representation of a lexical token
      public int kind;    // lexical class of this Token
      public int val;     // if kind is INT, val = integer value
      public String id;   // if kind is ID, id = the identifier

      // lexical classes
      public static final int EOF = 0;    // end-of-file
      public static final int ID  = 1;    // identifier
      public static final int INT = 2;    // integer constant
      public static final int LPAREN = 3; // (
      public static final int RPAREN = 4; // )
      public static final int SCOLON = 5; // ;
      public static final int RETURN = 6; // return keyword
      public static final int WHILE = 7;  // while keyword
      ...                                 // etc.
   }

The token class may well contain some other details, but it is basically a simple data structure. The main difference when this is implemented in C is that we would specify Token as a struct and the individual lexical classes as symbolic constants using #define, and we would need to allocate space for the identifier string, either statically in the token (good enough for our project) or dynamically.

Manual Scanner Implementation

The key method needed in a scanner is one that yields the next token from the input when it is called by the parser. Here is a sketch of how that could look in Java, glossing over all of the details of file handling, character input, etc., and taking liberties with syntax. A C version is quite similar, except that the result token would be a local variable (struct) instead of a newly allocated object on the heap.

   public class Scanner {    // lexical scanner object
      private char nextch;   // = next unprocessed character from source program
      ...
      // Yield next token from input program
      public Token nextToken( ) {
         Token result = new Token( );   // result Token (local variable in C)

         advance nextch past any whitespace and comments;
         switch(nextch) {
            case '(': result.kind = Token.LPAREN; advance nextch; break;
            case ')': result.kind = Token.RPAREN; advance nextch; break;
            case '!': // ! or !=
                      advance nextch;
                      if (nextch == '=') {
                        result.kind = Token.NOTEQUAL; advance nextch;
                      } else result.kind = Token.NOT;
                      break;
                      ...
            case '0': case '1': ... case '9':
                      // integer constant
                      read remaining digits into a String s;
                      result.kind = Token.INT;
                      result.val = Integer(s).intValue();  // use atoi in C for conversion
                      break;
            case 'a': case 'b': ... case 'z':
            case 'A': case 'B': ... case 'Z':
                      // identifier or keyword
                      read rest of identifier into a String s;
                      if (s is a keyword)
                         result.kind = Token.keywordkind;
                      else {
                         result.kind = Token.ID;
                         result.id = s;
                      }
                      break;
         }  // end switch
         
         return result;
     } // end nextToken
  }