CSE 341: Grammars, Language Specification, and Interpreters

This is not a course in language implementation, but language specification and interpreter implementation are actually quite closely related topics. In this lecture, we will briefly explore this relationship.

You should come out of this lecture with a rough idea of how to pursue a language design from beginning to end --- from syntax, through semantics, through a prototype implementation in an interpreter.

We have generally used an informal English description of syntax and semantics to specify the meaning of ML language constructs. However, language designers have more formal tools for doing this as well:

Grammars: Specifying syntax

There is an entire branch of mathematics and computer science that formalizes the properties of "languages" over strings and their syntax. You studied, or will study, the properties of formal languages (and the automata (abstract machines) which generate and parse them) in your theory courses. Here we only mention a few aspects of formal language theory directly relevant to language design. Some terminology:

Backus-Naur Form (context-free languages)

Backus-Naur Form, also called BNF, is the usual primary formalism for specifying language syntax. Named after John Backus and Peter Naur, BNF form describes a class of languages called context-free languages; a context-free language consists of four elements:

  1. A set of terminals, which are individual tokens in the language. Tokens comprise the alphabet of a BNF grammar. Typical examples of terminals include
  2. A set of nonterminals, which are higher-level constructs formed by combining sequences of terminals or nonterminals. For example, the if/then/else construct in ML is a nonterminal --- it is built up out of many tokens.
  3. A set of productions, which are rules for building up nonterminals from terminals and nonterminals. In BNF, productions are written using the ::= operator, with a nonterminal on the left and one or more terminals or nonterminals on the right:
    ifExpr ::= if expr then expr else expr
    Note our typographical conventions for terminals (e.g. if) and nonterminals (e.g., expr). You will often see slightly different conventions --- e.g., writing terminals in "quotes", or nonterminals in <angle brackets>. Productions may have more than one case; cases are separated by vertical bars:
  4. expr ::= binOpExpr | ifExpr | ...
  5. A starting nonterminal, which is the "topmost" nonterminal which represents a complete program. In ML, this might be topLevels:
    topLevels ::= topLevel | topLevel topLevels
    topLevel ::= expr | decl
    Note that we use a recursive definition so that there can be more than one toplevel in a program.

In practice, rather than declaring the terminals and nonterminals separately and explicitly, human users of BNF usually simply list the BNF productions, using typographical conventions or quotation marks to indicate which names are terminals and nonterminals. Here is a simple BNF grammar for the subset of ML whose evaluator you implemented in Homework 3:

constLiteral ::= BOOL_LITERAL | INT_LITERAL

expr ::= constLiteral
       | IDENTIFIER
       | expr + expr
       | not expr
       | expr = expr
       | let decl in expr end
       | if expr then expr else expr
       | ( expr )

decl ::= val IDENTIFIER = expr

Regular expressions (regular languages)

BNF is sufficiently expressive to encode the complete syntax of programming languages, but by itself it is rather verbose for specifying the allowed forms of individual tokens. This is why the BNF grammar itself will typically treat an identifier as an atomic symbol, rather than going all the way down to the character level.

The process of taking raw strings and turning them into streams of tokens is called lexical analysis or "lexing"; sometimes tokens are called lexemes.

To define the format of lexemes, language specifications typically use regular expressions (or regexps). A regular expression is a string which can be interpreted as a specification of a language.

A regular expression is defined recursively as follows:

The regular expression denoting a keyword is simple: if matches "if", else matches "else", etc.

In many languages, an integer literal is any positive number of digits. Here is a regular expression defining integer literals:

(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*

In other words, an integer literal is any digit, followed by zero or more occurrences of any digit.

Syntactic sugars

As the integer literal example shows, raw regular expressions are cumbersome. In practice, people add several syntactic sugars over BNF and regular expressions, in order to make them more convenient to use.

Syntactic sugars for regexps

Common syntactic sugars for regular expressions include:

These turn out to be sufficiently expressive to encode most things programming language designers are interested in. For example:

IDENTIFIER = [A-Za-z][A-Za-z0-9]*
INTEGER    = [0-9]+
REAL       = [0-9]+.[0-9]+

Extended BNF

Extended BNF means any variation of BNF with more features. Here are some BNF extensions I like to use:

The file in your interpreter project named ast.sml has a sample EBNF grammar for the subset of ML that you will be implementing in your project.

Aside: parser and lexer generators

Humans are not the only consumers of BNF and regular expresions. Programs called parser generators and lexer generators use a variant of the BNF and regular expressions, respectively, to generate code that parsers the specified language.

For most sensible languages, human beings no longer need to write parsers by hand. Parser generators available today include yacc, bison, javacc, sablecc, and mlyacc (the last of these is used in your ML project, although you do not have to deal with it in order to complete the project).

We won't be covering parser generators further in this class. If you're curious, I highly recommend a course in language implementation (compilers).

Detour: Operational semantics revisited

Earlier in the quarter, we discussed language constructs using the "Language Construct X in a Nutshell" format. The style of semantic specification we used is actually an informal (plain English) variant of operational semantics.

There also exist more formal mathematical notations, which have the virtue of compactness and clarity. Fig. 1 shows evaluation rules in "big-step operational semantics" for if expressions in ML:

[if inference rule]
Fig. 1: Formal notation for operational semantics of if expressions in ML.

This looks foreign, but it's really just our familiar semantics from "Language Construct X in a Nutshell" written using a more compact notation:

In English, the above rules translates to:

This should sound rather familiar to you, from previous lectures and from the evaluator you implemented in Homework 3. Note that keeping the environment when evaluating subexpressions is critical --- otherwise, you'll have no way to evaluate variable references in that subexpression.

In practice, only academic language designers use the mathematical style of language specification; but (in the humble opinion of your instructor) this is unfortunate, because the mathematical style is usually clearer and less open to ambiguous interpretation.

Case study: MicroC, and an interpreter thereof

We have seen that a language design requires three components:

  1. A syntax
  2. A dynamic semantics
  3. A static semantics (if it is a statically typed language)

In this section, we will outline the syntax and dynamic semantics of MicroC, a tiny C-like language (this subset actually is mostly shared in common with Java, so we could also call it MicroJava). Then we will demonstrate the skeleton of an interpreter for it.

Language design

Syntax

program ::= decl* stmt*

decl ::= typeName IDENTIFIER [= expr] ;

stmt ::= expr ;
       | if ( expr ) stmt else stmt
       | while ( expr ) stmt
       | { stmt*\; }

expr ::= INT_LITERAL
       | FLOAT_LITERAL
       | IDENTIFIER
       | IDENTIFIER = expr
       | expr + expr
       | expr == expr

typeName ::= int | float

Notice that we have only int and float types. We will use the C convention: all non-zero values are implicitly interpreted as true, and zero is implicitly interpreted as false.

Operational semantics

We don't have time to do a full operational semantics of MicroC, but let's do the most "interesting" expression. Consider assignment:

IDENTIFIER = expr

Assignment assigns the variable with the given name a new value. This is interesting because, unlike any of the expressions we've seen in our ML evaluator, it has a side effect: it updates the variable's current value. To implement this semantics, each expression must not only return a value, but a new environment that with the updated value.

Therefore, all evaluation rules in our (dynamic) operational semantics will return two values: the return value of the expression, and the updated environment.

The only other thing worth noting is that in C and Java, an assignment expression returns the value of the left-hand side after the assignment has been performed.

Our dynamic semantics of assignment is therefore as follows:

If an expression e evaluates to a value v in an environment Env, then the expression varName = e evaluates to v and produces the updated environment Env', where Env' is Env with varName = v substituted for the old binding of varName.

Interpreter skeleton

Writing an interpreter for a language generally involves the following steps:

  1. Define a data type to represent each syntactic form in your language.
  2. Write a parser that transforms strings into the data type you defined in the step above. (Usually, you will use a parser generator.)
  3. Define a data type to represent values and the heap (or environment). Usually, this is a map from from names to values (an environment) or from addresses to values (a heap).
  4. Implement the dynamic semantics of each syntactic form (what it does at runtime), based on its operational semantics.
  5. For statically typed languages, implement the static semantics of each syntactic form, based on its operational semantics. (How it is typechecked at "compile"-time.)

(The above presumes that you have some mechanism for implementing the runtime services of the language. For example, if your source language is garbage collected, then either your implementation language must be garbage collected (like ML) or you must link a garbage collection library to your implementation.)

In the following, we'll sketch steps 1, 3, and 4 of the above for MicroC.

ML data type for MicroC syntactic forms

datatype typeName = TInt | TFloat

datatype expr = IntExpr of int
              | FloatExpr of real
              | AssignExpr of string * expr
              | VarExpr of string
              | AddExpr of expr * expr
              | EqExpr of expr * expr

datatype stmt = EvalStmt of expr
              | IfStmt of expr * stmt * stmt
              | WhileStmt of expr * stmt
              | BlockStmt of stmt list

datatype decl = Decl of typeName * string * expr

datatype program = Program of decl list * stmt list

Notice that we list the datatypes in reverse order, because each nonterminal's datatype uses "smaller" nonterminals in its definition. We could have gotten around this by making the datatypes mutually recursive, but it's not necessary.

Implementing MicroC values and environments

datatype value = IntVal of int | FloatVal of real

type environment = (string * value) list

(* returns the value bound to name in e *)
fun lookupEnv(e:environment, name:string) = ...

(* returns the environment e with name's binding updated to newValue *)
fun updateEnv(e:environment, name:string, newValue:value) = ...

Implementing an evaluator

We will require a function to evaluate each syntactic form. It's usually easiest to work your way up from expressions. Recall that expressions are evaluated in an environment, and return both their value and the new environment that results from evaluating any assignments that may be in the expression. Hence, our expression evaluation function will have type

environment * expr -> value * environment

Here are a few interesting cases:

fun evalExpr(env, IntExpr i) = (IntVal i, env)
  | evalExpr(env, VarExpr s) = lookupEnv(env, s)
  | evalExpr(env, AssignExpr(id, exp)) =
    let
        val (v, newEnv) = evalExpr(env, exp)
    in
        (v, updateEnv(newEnv, id, v))
    end
  | ...

Next, we must implement an evaluation function for statements. Like expressions, statements may have side effects; unlike expressions, statements do not have a value. Hence, our statement evaluation function will have type

environment * stmt -> environment

Here are a couple of cases:

fun evalStmt(env, EvalStmt e) =
    let val (_, newEnv) = evalExpr(env, e) in newEnv end
  | evalStmt(env, IfStmt(e, s1, s2)) =
    let
        val (v, newEnv) = evalExpr(env, e)
        val vIsZero =
            case v of IntVal i => i = 0
                    | FloatVal f => f <= 0.0 andalso f >= 0.0
    in
        if vIsZero then
            evalStmt(newEnv, s1)
        else
            evalStmt(newEnv, s2)
    end
  | ...

Next up are declarations; a declaration takes an old environment and produces a new one, and so its evaluation function has type:

environment * decl -> environment

Declarations so far have only one case (notice that we do not implement type checking in the evaluation function):

fun evalDecl(env, Decl(t, name, exp)) =
    let
        val (v, newEnv) = evalExpr(env, exp)
    in
        updateEnv(newEnv, name, v)
    end

Finally, we can evaluate programs:

fun eval(Program(decls, stmts)) =
    let
        fun evalDeclList(env, nil) = env
          | evalDeclList(env, d::ds) = evalDeclList(evalDecl(env, d), ds)

        fun evalStmtList(env, nil) = env
          | evalStmtList(env, s::ss) = evalStmtList(evalStmt(env, s), ss)

        val declsEnv = evalDeclList(nil, decls)
    in
        evalStmtList(declsEnv, stmts)
    end