CSE 413 Spring 2000

Lexical Analysis

Overview

A compiler front-end can be constructed systematically using the syntax of the language. In principle, we could give a single context-free grammar defining the language down to the character level. For example, a simple language for arithmetic expressions could be defined as follows.

expr ::= term | expr + term
term ::= factor | term * factor
factor ::= ( expr ) | id | int
id ::= ...
int ::= ...

We would need to add more rules to describe that an id, for example, consists of a letter and underscore followed by 0 or more letters, digits, and underscores. We'd need to do the same for int, and we'd also have to include rules defining comments and whitespace. Once that's done we could construct a complete parser for this concrete grammar.

But the result would be complex and slow. Better: separate the character handling details from the task of parsing the input tokens and determining the grammatical structure. Advantages:

Simplicity. The parser sees a much simpler input language consisting of lexical tokens without extraneous details like comments and whitespace. The output from the scanner can include additional tokens to help the parser like an EOF token to indicate the end of input.
Speed. The lexical grammar of most programming languages can be described by regular expressions, and a scanner can use simple, fast finite automata techniques to recognize the tokens (with some ad hoc fooling around for languages with unusual lexical structure). Using a full context-free parser to read the characters is much slower.

Scanner

Input: Character stream representation of the program
Output: Token stream (often delivered on demand when parser calls "nextToken" routine in the scanner). Whitespace and comments are skipped and do not appear in the token stream.

The scanner is also usually responsible for reporting lexical errors in the input (for example a:=b; in a Java/C/C++ program, where := is not a legal token).

Tokens

Idea: We want a distinct token kind (lexical class) to represent each distinct terminal symbol that appears in the programming language's grammar. Possibilities:

Operators and other punctuation (+ - * ( ) { } ; ...). Each of these is distinct lexical class
Keywords (if, else, while, for, return, void, int, ...). Each of these should also be a distinct lexical class.
Numeric constants. All integers are instances of a single lexical class. The token also needs to include the actual integer value so it can be processed later.
Identifiers. As with integers, there will be a single lexical class id for all non-keyword identifiers, but the token also needs to include the identifier as a character string for later processing.

Principle of Longest Match

In most languages, when there is a choice, the scanner picks the longest possible string to match a token. For example, given return distance != total; in a program, the scanner should normally analyze this as 5 tokens (RETURN, ID(distance), NOTEQUAL, ID(total), SCOLON), rather than returning individual letters, parts of identifiers, or the ! and = as separate tokens.

Regular Expressions

The lexical grammar of most programming languages can be described by regular expressions, and tokens can be recognized by an appropriate finite automaton. The lexical grammar can either be used to guide construction of a hand-crafted scanner, or can be used as input to a scanner generator program like Lex or JLex (which we will not use in CSE 413).

Definitions: Regular expressions are defined over some alphabet \Sigma (imagine a Greek letter here). If re is a regular expression, then L(re) is the language (set of strings) generated by that regular expression. The fundamental definitions are

re	L(re)	notes
a	{ a }	for each a contained in \Sigma
\epsilon	{ \epsilon }	the null string (another imagined Greek letter)
Ų	{ }	empty language

If r and s are regular expressions, then the following operations can be applied to them

re	L(re)	notes
rs	L(r)L(s)	concatenation
r \| s	L(r) union L(s)	combination
r*	L(r)*	0 or more occurrences (Kleene closure)

These basic operations generate all regular expressions. There are, however, some convenient abbreviations that are commonly used.

abbreviation	meaning	notes
r+	(rr*)	1 or more occurrences
r?	(r \| \epsilon )	0 or 1 occurrence
[a-z]	(a \| b \| ... \| z)	any character in given range
[abc]	(a \| b \| c)	any of the given characters

Examples:

re	meaning
+	plus symbol
!=	! followed by =
return	keyword return (6 adjacent characters)
[a-zA-Z][a-zA-Z0-9_]*	typical programming language identifier

Finally, some systems allow abbreviations of the form name ::= re to make it easier to write definitions.

Example: Regular expression description of numeric constants in a typical programming language (including integer and floating point).

digit ::= [0-9]
digits ::= digit+
number ::= digits ( . digits )? ( [eE] (+ | -)? digits )?

A finite automata can be constructed directly from this definition to recognize numeric constants.

Token Representation

What follows is a sketch of how a hand-written scanner might be implemented. A key decision is the representation of the lexical tokens. In Java, we can use something like the following class.

   public class Token {   // representation of a lexical token
      public int kind;    // lexical class of this Token
      public int val;     // if kind is INT, val = integer value
      public String id;   // if kind is ID, id = the identifier

      // lexical classes
      public static final int EOF = 0;    // end-of-file
      public static final int ID  = 1;    // identifier
      public static final int INT = 2;    // integer constant
      public static final int LPAREN = 3; // (
      public static final int RPAREN = 4; // )
      public static final int SCOLON = 5; // ;
      public static final int RETURN = 6; // return keyword
      public static final int WHILE = 7;  // while keyword
      ...                                 // etc.
   }

The token class may well contain some constructors and other details, but it is basically a simple data structure.

Manual Scanner Implementation

The key method needed in a scanner is one that yields the next token from the input when it is called by the parser. Here is a sketch of how that would look, glossing over all of the details of file handling, character input, etc., and taking liberties with syntax.

   public class Scanner {    // lexical scanner object
      private char nextch;   // = next unprocessed character from source program
      ...
      // Yield next token from input program
      public Token nextToken( ) {
         Token result = new Token( );   // result Token

         advance nextch past any whitespace and comments;
         switch(nextch) {
            case '(': result.kind = Token.LPAREN; advance nextch; break;
            case ')': result.kind = Token.RPAREN; advance nextch; break;
            case '!': // ! or !=
                      advance nextch;
                      if (nextch == '=') {
                        result.kind = Token.NOTEQUAL; advance nextch;
                      } else result.kind = Token.NOT;
                      break;
                      ...
            case '0': case '1': ... case '9':
                      // integer constant
                      read remaining digits into a String s;
                      result.kind = Token.INT;
                      result.val = Integer(s).intValue();
                      break;
            case 'a': case 'b': ... case 'z':
            case 'A': case 'B': ... case 'Z':
                      // identifier or keyword
                      read rest of identifier into a String s;
                      if (s is a keyword)
                         result.kind = Token.keywordkind;
                      else {
                         result.kind = Token.ID;
                         result.id = s;
                      }
                      break;
         }  // end switch
         
         return result;
     } // end nextToken
  }