CSE 413 Autumn 2006

Lexical Analysis


A compiler front-end can be constructed systematically using the syntax of the language. In principle, we could give a single context-free grammar defining the language down to the character level. For example, a simple language for arithmetic expressions could be defined as follows.

expr ::= term | expr + term
term ::= factor | term * factor
factor ::= ( expr ) | id | int
id ::= ...
int ::= ...

We would need to add more rules to describe that an id, for example, consists of a letter and underscore followed by 0 or more letters, digits, and underscores. We'd need to do the same for int, and we'd also have to include rules defining comments and whitespace. Once that's done we could construct a complete parser for this concrete grammar.

But the result would be complex and slow. Better: separate the character handling details from the task of parsing the input tokens and determining the grammatical structure. Advantages:


The scanner is also usually responsible for reporting lexical errors in the input (for example a=!b; in a Java/C/C++ program, where =! is not a legal token).


Idea: We want a distinct token kind (lexical class) to represent each distinct terminal symbol that appears in the programming language's grammar. Tokens include:

Tokens are not character strings.  They should be implemented as a data structure containing a small integer constant (or enumeration) to identify the lexical class, plus, in the case of numbers and identifiers, the actual number or identifier.  (More details below.)  Also, whitespace and comments are not included as tokens - the parser should never see these.

Principle of Longest Match

Normally, when there is a choice the scanner picks the longest possible string to match a token. For example, given return distance != total; in a program, the scanner should normally analyze this as 5 tokens (RETURN, ID(distance), NOTEQUAL, ID(total), SCOLON), rather than returning individual letters, parts of identifiers, or the ! and = as separate tokens.

Regular Expressions

The lexical grammar of most programming languages can be described by regular expressions, and tokens can be recognized by an appropriate finite automaton. The lexical grammar can either be used to guide construction of a hand-crafted scanner, or can be used as input to a scanner generator program like Lex or JLex (which we will not use in CSE 413).

Definitions: Regular expressions are defined over some alphabet Σ. If re is a regular expression, then L(re) is the language (set of strings) generated by that regular expression. The fundamental definitions are

re L(re) notes
a { a } for each character a contained in Σ
ε { ε } the null string
{  } empty language

If r and s are regular expressions, then the following operations can be applied to them

re L(re) notes
rs L(r)L(s) concatenation
r | s L(r) ∪ L(s) combination (union)
r* L(r)* 0 or more occurrences (Kleene closure)

These basic operations generate all regular expressions. There are, however, some convenient abbreviations that are commonly used.

abbreviation meaning notes
r+ (rr*) 1 or more occurrences
r? (r | ε ) 0 or 1 occurrence
[a-z] (a | b | ... | z) any character in given range
[abc] (a | b | c) any of the given characters


re meaning
+ plus symbol
!= ! followed by =
return keyword return (6 adjacent characters)
[a-zA-Z][a-zA-Z0-9_]* typical programming language identifier

Finally, some systems allow abbreviations of the form name ::= re to make it easier to write definitions.

Example: Regular expression description of numeric constants in a typical programming language (including integer and floating point).

digit ::= [0-9]
digits ::= digit+
number ::= digits ( . digits )?  (  [eE] (+ | -)? digits )?

A finite automata can be constructed directly from this definition to recognize numeric constants.

Token Representation

What follows is a sketch of how a hand-written scanner might be implemented. A key decision is the representation of the lexical tokens. In Java, we can use something like the following class.

   public class Token {   // representation of a lexical token
      public int kind;    // lexical class of this Token
      public int val;     // if kind is INT, val = integer value
      public String id;   // if kind is ID, id = the identifier

      // lexical classes
      public static final int EOF = 0;    // end-of-file
      public static final int ID  = 1;    // identifier
      public static final int INT = 2;    // integer constant
      public static final int LPAREN = 3; // (
      public static final int RPAREN = 4; // )
      public static final int SCOLON = 5; // ;
      public static final int RETURN = 6; // return keyword
      public static final int WHILE = 7;  // while keyword
      ...                                 // etc.

The token class may well contain some other details, but it is basically a simple data structure.  The main difference when this is implemented in C is that we would specify Token as a struct and the individual lexical classes as symbolic constants using #define, and we would need to allocate space for the identifier string, either statically in the token (good enough for our project) or dynamically.

Manual Scanner Implementation

The key method needed in a scanner is one that yields the next token from the input when it is called by the parser. Here is a sketch of how that could look in Java, glossing over all of the details of file handling, character input, etc., and taking liberties with syntax. A C version is quite similar, except that the result token would be a local variable (struct) instead of a newly allocated object on the heap.

   public class Scanner {    // lexical scanner object
      private char nextch;   // = next unprocessed character from source program
      // Yield next token from input program
      public Token nextToken( ) {
         Token result = new Token( );   // result Token (local variable in C)

         advance nextch past any whitespace and comments;
         switch(nextch) {
            case '(': result.kind = Token.LPAREN; advance nextch; break;
            case ')': result.kind = Token.RPAREN; advance nextch; break;
            case '!': // ! or !=
                      advance nextch;
                      if (nextch == '=') {
                        result.kind = Token.NOTEQUAL; advance nextch;
                      } else result.kind = Token.NOT;
            case '0': case '1': ... case '9':
                      // integer constant
                      read remaining digits into a String s;
                      result.kind = Token.INT;
                      result.val = Integer(s).intValue();  // use atoi in C for conversion
            case 'a': case 'b': ... case 'z':
            case 'A': case 'B': ... case 'Z':
                      // identifier or keyword
                      read rest of identifier into a String s;
                      if (s is a keyword)
                         result.kind = Token.keywordkind;
                      else {
                         result.kind = Token.ID;
                         result.id = s;
         }  // end switch
         return result;
     } // end nextToken