Standard ML is a type-safe programming language that embodies many innovative ideas in programming language design. It is a statically-typed language, with a user-extensible type system. It supports polymorphic type inference, which all but eliminates the burden of specifying types of variables and greatly facilitates code re-use. It provides efficient automatic storage management for data structures and functions. It encourages functional (effect-free) programming where appropriate, but allows imperative (effect-ful) programming where necessary (e.g., for handling I/O or implementing mutable data structures). It facilitates programming with recursive data structures (such as trees and lists) by encouraging the definition of functions by pattern matching. It features an extensible exception mechanism for handling error conditions and effecting non-local transfers of control. It provides a richly expressive and flexible module system for structuring large programs, including mechanisms for enforcing abstraction, imposing hierarchical structure, and building generic modules. It is portable across platforms and implementations because it has a precise definition given by a formal operational semantics that defines both the static and dynamic semantics of the language. It provides a portable standard basis library that defines a rich collection of commonly-used types and routines.
These features are supported by all implementations of Standard ML, but many go beyond the standard to provide experimental language features, more extensive libraries, and handy program development tools. Details can be found with the documentation for your compiler, but here's a brief overview of what you might expect. Most implementations provide an interactive system supporting on-line entry and execution of ML programs and providing access to tools for compiling, linking, and analyzing the behavior of programs. A few compilers are "batch only", relying on the ambient operating system to manage the construction of large programs from compiled parts. Nearly every compiler is capable of generating native machine code, even in the interactive system, but some optionally generate byte codes for a portable abstract machine. Most implementations support separate compilation and incremental recompilation based on automatically-generated or manually-constructed component dependency specifications. Some implementations provide interactive tools for tracing and stepping programs; many provide tools for time and space profiling. Most implementations supplement the standard basis library with a rich collection of handy components such as dictionaries, hash tables, or interfaces to the ambient operating system. Some implementations support experimental language extensions, notably mechanisms for concurrent programming (using message-passing or locking), richer forms of modularity constructs, and support for "lazy" data structures.
To develop a feel for the language and how it is used, let us consider a small, but non-trivial, program to implement a regular expression package for checking whether a given string matches a given regular expression. We'll structure the implementation into two modules, an implementation of regular expressions themselves and an implementation of a matching algorithm for them. The structure of the system is neatly expressed using signatures that describe the components of these two modules.
signature REGEXP = sig
datatype regexp =
Zero | One | Char of char |
Plus of regexp * regexp | Times of regexp * regexp |
Star of regexp
exception SyntaxError of string
val parse : string -> regexp
val format : regexp -> string
end
signature MATCHER = sig
structure RegExp : REGEXP
val match : RegExp.regexp -> string -> bool
end
The signature REGEXP
describes a module that implements regular
expressions. It consists of a description of the abstract syntax of regular
expressions, together with operations for parsing and unparsing (formatting) them.
The definition of the abstract syntax takes the form of a datatype declaration
that is reminiscent of a context-free grammar, but which abstracts from matters of lexical
presentation (such as precedences of operators, parenthesization, conventions for naming
the operators, etc.) The abstract syntax consists of 6 clauses,
corresponding to the regular expressions 0, 1, a,
r1 + r2, r1 r2, and r*.
The functions parse
and format
specify the parser and unparser
for regular expressions. The parser takes a string as argument and yields a regular
expression; if the string is ill-formed, the parser raises the exception SyntaxError with
an associated string describing the source of the error. The unparser takes a
regular expression and yields a string that parses to that regular expression. In
general there are many strings that parse to the same regular expressions; the unparser
generally tries to choose one that is easiest to read.
The signature MATCHER
describes a module that implements a matching
algorithm for regular expressions. The matcher is a function match
that
takes a regular expression and yields a function that takes a string and determines
whether or not that string matches that regular expression. Obviously the matcher is
dependent on the implementation of regular expressions. This is expressed by a structure
specification that specifies a hierarchical dependence of an implementation of a
matcher on an implementation of regular expressions --- any implementation of the MATCHER
signature must include an implementation of regular expressions as a constituent module.
This ensures that the matcher is self-contained, and does not rely on implicit
conventions for determining which implementation of regular expressions it
employs.
Now let's look at the high-level structure of an implementation of a regular expression matcher. It consists of two major components: an implementation of regular expressions, and an implementation of the matcher. Implementations of signatures are called structures in ML; the implementation of the regular expression matcher consists of two structures. Since the implementation of the matcher depends on an implementation of regular expressions, but is independent of any particular implementation of regular expressions, we use a parameterized module, or functor, to implement it. Here's the high-level structure we're considering:
structure RegExp :> REGEXP = ...
functor Matcher (structure RegExp : REGEXP) :> MATCHER = ...
structure Matcher :> MATCHER = Matcher (structure RegExp = RegExp)
The structure identifier RegExp
is bound to an implementation of the REGEXP
signature. Conformance with the signature is ensured by the ascription of
the signature REGEXP
to the binding of RegExp
using the
":>" notation. Not only does this check that the implementation (elided
here) conforms with the requirements of the signature REGEXP
, but it also
ensures that subsequent code cannot rely on any properties of the implementation other
than those explicitly specified in the signature. This helps to ensure that modules
are kept separate, facilitating subsequent changes to the code.
The functor identifier Matcher
is bound to a structure that takes an
implementation of REGEXP
as parameter. We may think of Matcher
as a kind of function mapping structures to structures. The result signature of the
functor specifies that the implementation must conform to the requirements of the
signature MATCHER
, and ensures that the only visible properties of any
instance of this functor (obtained by applying it to an implementation of REGEXP
)
are precisely what is specified in that signature. A specific matcher is provided by
applying the functor Matcher
to the stucture RegExp
to obtain an
implementation of MATCHER
.
Once the system is built, we may use it by referring to its components using paths,
or long identifiers. The function Matcher.match
has type Matcher.RegExp.regexp
-> string -> bool
, reflecting the fact that it takes a regular expression as
implemented within the package itself and yields a matching function on strings.
We may build a regular expression by applying the parser, Matcher.RegExp.parse
,
to a string representing a regular expression, then passing this to Matcher.match
.
Here's an example:
val regexp = Matcher.RegExp.parse "((a + %).(b + %))*"
val matches = Matcher.match regexp
matches "aabba" (* matches successfully *)
matches "abac" (* fails to match *)
We use the convention that "@
" stands for the empty regular
expression and "%
" stands for the regular expression accepting only
the null string. Concatentation is indicated by a ".
",
alternation by "+
", and iteration by "*
".
The use of long identifiers can get tedious at times. There are two typical methods for alleviating the burden. One is to introduce a synonym for a long package name. Here's an example:
structure M = Matcher
structure R = M.RegExp
val regexp = R.parse "((a + %).(b + %))*"
val matches = M.match regexp
matches "aabba"
matches "abac"
Another is to "open" the structure, incorporing its bindings into the current environment:
open Matcher Matcher.RegExp
val regexp = parse "((a + %).(b + %))*"
val matches = match regexp
matches "aabba"
matches "abac"
It is advisable to be sparing in the use of open
because it is often hard
to anticipate exactly which bindings are incorporated into the environment by its use.
Now let's look at the internals of these structures. Here's an overview of the implementation of regular expressions:
structure RegExp :> REGEXP = struct
datatype regexp =
Zero | One | Char of char |
Plus of regexp * regexp | Times of regexp * regexp |
Star of regexp
... implementation of the tokenizer ...
fun tokenize s = tokenize_exp (String.explode s)
... implementation of the parser components ...
fun parse s =
let
val (r, s') = parse_exp (tokenize (String.explode s))
in
case s'
of nil => r
| _ => raise SyntaxError "Unexpected input.\n"
end
handle LexicalError => raise SyntaxError "Illegal input.\n"
... implementation of the formatter ...
fun format r =
String.implode (format_exp r)
end
The implementation is bracketed by the keywords struct
and end
.
The type regexp
is implemented precisely as specified by a datatype
declaration. The parser works by "exploding" the string into a list of
characters (making it easier to process them character-by-character), transforming the
character list into a list of "tokens" (abstract symbols representing lexical
atoms), and finally parsing the resulting list of tokens. If there is remaining
input after the parse, or if the tokenizer encountered an illegal token, an appropriate
syntax error is signalled. The formatter works by calling an associated function
that yields a list of characters, then "imploding" that list into a string.
It is interesting to consider in more detail the structure of the parser since it exemplifies well the use of pattern matching to define functions. Let's start with the tokenizer, which we present here in toto:
datatype token =
AtSign | Percent | Literal of char | PlusSign | TimesSign |
Asterisk | LParen | RParen
exception LexicalError
fun tokenize nil = nil
| tokenize (#"+" :: cs) = (PlusSign :: tokenize cs)
| tokenize (#"." :: cs) = (TimesSign :: tokenize cs)
| tokenize (#"*" :: cs) = (Asterisk :: tokenize cs)
| tokenize (#"(" :: cs) = (LParen :: tokenize cs)
| tokenize (#")" :: cs) = (RParen :: tokenize cs)
| tokenize (#"@" :: cs) = (AtSign :: tokenize cs)
| tokenize (#"%" :: cs) = (Percent :: tokenize cs)
| tokenize (#"\\" :: c :: cs) = Literal c :: tokenize cs
| tokenize (#"\\" :: nil) = raise LexicalError
| tokenize (#" " :: cs) = tokenize cs
| tokenize (c :: cs) = Literal c :: tokenize cs
We use a datatype declaration to introduce the type of tokens corresponding to the
symbols of the input language. The function tokenize
has type char
list -> token list
; it transforms a list of characters into a list of tokens.
It is defined by a series of clauses that dispatch on the first character of the
list of characters given as input, yielding a list of tokens. The correspondence
between characters and tokens is relatively straightforward, the only non-trivial case
being to admit the use of a backslash to "quote" a reserved symbol as a
character of input. (More sophisticated languages have more sophisticated token
structures; for example, words (consecutive sequences of letters) are often regarded as a
single token of input.) Notice that it is quite natural to "look ahead" in
the input stream in the case of the backslash character, using a pattern that dispatches
on the first two characters (if there are such) of the input, and proceeding accordingly.
(It is a lexical error to have a backslash at the end of the input.)
Now here's an overview of the parser. It is a simple recursive-descent parser
implementing the standard precedence conventions for regular expressions (iteration binds
most tightly, then concatentation, then alternation). The parser is defined by four
mutually-recursive functions, parse_exp
, parse_term
, parse_factor
,
and parse_atom
. These implement a recursive descent parser that
dispatches on the head of the token list to determine how to proceed. The code is
essentially a direct transcription of the obvious LL(1) grammar for regular expressions
capturing the binding conventions described earlier.
fun parse_exp ts =
let
val (r, ts') = parse_term ts
in
case ts'
of (PlusSign :: ts'') =>
let
val (r', ts''') = parse_exp ts''
in
(Plus (r, r'), ts''')
end
| _ => (r, ts')
end
and parse_term ts = ... (elided) ...
and parse_factor ts =
let
val (r, ts') = parse_atom ts
in
case ts'
of (Asterisk :: ts'') => (Star r, ts'')
| _ => (r, ts')
end
and parse_atom nil = raise SyntaxError ("Atom expected\n")
| parse_atom (AtSign :: ts) = (Zero, ts)
| parse_atom (Percent :: ts) = (One, ts)
| parse_atom ((Literal c) :: ts) = (Char c, ts)
| parse_atom (LParen :: ts) =
let
val (r, ts') = parse_exp ts
in
case ts'
of (RParen :: ts'') => (r, ts'')
| _ => raise SyntaxError ("Right-parenthesis expected\n")
end
Once again it is quite simple to implement "lookahead" using patterns that inspect the token list for specified tokens. This parser makes no attempt to recover from syntax errors, but one could imagine doing so, using standard techniques.
This completes the implementation of regular expressions. Now for the matcher. The main idea is to implement the matcher by a recursive analysis of the given regular expression. The main difficulty is to account for concatenation --- to match a string against the regular expression r1 r2 we must match some initial segment against r1, then match the corresponding final segment against r2. This suggests that we generalize the matcher to one that checks whether some initial segment of a string matches a given regular expression, then passes the remaining final segment to a continuation, a function that determines what to do after the initial segment has been successfully matched. This facilitates implementation of concatentation, but how do we ensure that at the outermost call the entire string has been matched? We achieve this by using an initial continuation that checks whether the final segment is empty. Here's the code, written as a functor parametric in the regular expression structure:
functor Matcher (structure RegExp : REGEXP) :> MATCHER = struct
structure RegExp = RegExp
open RegExp
fun match_is Zero _ k = false
| match_is One cs k = k cs
| match_is (Char c) (d::cs) k = if c=d then k cs else false
| match_is (Times (r1, r2)) cs k =
match_is r1 cs (fn cs' => match_is r2 cs' k)
| match_is (Plus (r1, r2)) cs k =
match_is r1 cs k orelse match_is r2 cs k
| match_is (Star r) cs k =
k cs orelse match_is r cs (fn cs' => match_is (Star r) cs' k)
fun match r s =
match_is r (String.explode s) (fn nil => true | false)
end
Note that we must incorporate the parameter structure into the result structure, in
accordance with the requirements of the signature. The function match
explodes the string into a list of characters (to facilitiate sequential processing of the
input), then calls match_is
with an initial continuation that ensures that
the remaining input is empty to determine the result. The type of match_is
is
RegExp.regexp -> char list -> (char list -> bool) -> bool
.
That is, match_is
takes in succession a regular expression, a list of
characters, and a continuation of type char list -> bool
; it yields as
result a value of type bool
. This is a fairly complicated type, but
notice that nowhere did we have to write this type in the code! The type inference
mechanism of ML took care of determining what that type must be based on an analysis of
the code itself.
Since match_is
takes a function as argument, it is said to be a higher-order
function. The simplicity of the matcher is due in large measure to the ease
with which we can manipulate functions in ML. Notice that we create a new, unnamed
functions, to pass as a continuation in the case of concatenation --- it is the function
that matches the second part of the regular expression to the characters remaining after
matching an initial segment against the first part. We use a similar technique to
implement matching against an iterated regular expression --- we attempt to match the null
string, but if this fails, we match against the regular expression being iterated followed
by the iteration once again. This neatly captures the "zero or more times"
interpretation of iteration of a regular expression.
(Important aside: the code given above contains a subtle error. Can you find it? If not, see the chapter on proof-directed debugging for further discussion!)
This completes our brief overview of Standard ML. The remainder of these notes are structured into three parts. The first part is a detailed introduction to the core language, the language in which we write programs in ML. The second part is concerned with the module language, the means by which we structure large programs in ML. The third is about programming techniques, ideas for building reliable and robust programs. I hope you enjoy it!