ML is clean and powerful, and has many traits that language designers consider hallmarks of a good high-level language:
ML is the "exemplary" statically typed, strict functional programming language.
Fig. 1 gives an abbreviated family DAG of the ML family, and a few related languages. Dotted lines indicate significant omitted nodes. The rounded box indicates those variants of the ML family that most people would call ML. A brief history of ML:
Miranda and Haskell are statically typed lazy (as opposed to strict) functional languages, with many similarities to ML (including ML-like polymorphic type systems).
EML and Cyclone are research languages devised by people who are currently at UW, or have been in the past. (We will discuss these languages towards the end of the quarter.) They are marked as descending from the entire ML family because the distinctions between, e.g., SML and O'Caml are not important w.r.t. the way ML influenced these languages.
The general ideas of ML have been highly influential in the research community; if we enumerated all the ML dialects or relatives that researchers have devised over the years, the dotted line to the lower right of the figure would probably have hundreds of descendants.
For this class, we'll use the SML/NJ implementation of ML97. Like most ML implementations, SML/NJ provides a read-eval-print loop ("repl"), so named because the interpreter repeatedly performs the following:
The primary advantage of programming in a repl is immediate feedback. The read-eval-print cycle is much faster than the edit-compile-run cycle in a typical compiled programming environment. You can quickly and easily experiment with different snippets of code. If a function doesn't work, you can try out a different version in a second or two, and re-run your program. This makes interactive repls ideal for "exploratory" programming.
(Often in the course of my teaching, a student has presented me with a code snippet and asked: "What happens if I write X? Or XY? Or, how about XYZ?" Of course, the best way to find out is simply to write X, Y, and Z, and then run the various combinations. But in a compiled environment, you have to create a new file, and repeatedly compile each different version of the program. In a repl, it's easy to quickly experiment interactively with all these variations.)
Of course, typing long chunks of code repeatedly can be
tedious, so the repl allows you to load source from a file with
the use
function, which takes a string
filename and loads the contents of the named file as though it
were typed into the interpreter directly. (You can also use Unix
pipes, or programming environments like Emacs sml-mode, to send
code into the interpreter.)
All programming languages allow users to manipulate data, and all useful languages provide two kinds of data:
We'll start with atomic data. Here's the result of entering some expressions that evaluate to atomic data into the SML/NJ read-eval-print loop:
$ sml Standard ML of New Jersey, Version 110.0.7 ... - 3; val it = 3 : int - 3.0; val it = 3.0 : real - #"3"; val it = #"3" : char - "3"; val it = "3" : string - true; val it = true : bool
The dash is the SML/NJ prompt indicating that it's waiting for
you to type in an expression or a declaration. When you type in
an expression followed by a semicolon, SML/NJ
parses the expression, then
evaluates it to a value. Then
it prints that value, along with its type. The above values are
of type bool
, int
, real
,
char
, and string
respectively.
There are many operators defined over atomic types, including most of the ones you'd expect. See Ullman sections 2.1-2.2 and ch. 9.1 for information about these. Minor surprises:
~
is the arithmetic negation operator, and it
is distinct from the subtraction operator.andalso
and orelse
respectively.Technical note: Values are expressions that are "done
evaluating". Therefore, 3
is a value, whereas
3 + 4
is not a value, because this expression can
evaluate one more step, to 7
.
Technical note 2: ML also has an odd atomic type called
unit
. unit
has only one value, which is
written ()
(empty parens):
- (); val it = () : unit
unit
plays a role similar to (but not identical
to) that of void
in other languages --- for example,
functions that don't have a meaningful return value will have
return type unit
. The difference is that
()
is a real value --- one that can be bound to
names, passed to functions, etc., just like any other value.
We'll discuss the relative merits of unit
vs. void
more when we discuss functions.
val
bindingsBut what is this val it = 3
business? In order to
explain this, we must first examine bindings,
which resemble what other languages call "variables". Bindings
are declarations; the val
declaration binds a value to a name. A bound
name can then be used later to refer to the value that was bound
to it:
- val x = 3; val x = 3 : int - x; val it = 3 : int - x + 3; val it = 7 : int
Aha, now we can guess what it
is...
- it; val it = 3 : int
When you do not bind an expression to a name at the top-level
interpreter prompt, it gets bound to the name it
by
default. This is not a feature of ML per se; it's just a
helpful feature of the SML/NJ repl. If you want to prevent this,
you can bind the value to the wildcard,
_
(single underscore):
- val _ - 4; - it; val it = 3 : int
Notice that the interpreter does not print the val it =
...
after the wildcard binding, and that it
is
unchanged afterwards. The wildcard _
is not
a variable name; it's a placeholder that means, "evaluate this as
if you were binding it to a name, but instead throw it away".
We'll revisit wildcards in more depth when we discuss pattern
matching.
Name bindings resemble variable declarations in a language like C or Java, with several important differences:
by reference
; that
is, the declaration val x = 3;
refers to a
value 3 in the heap. Another way of saying this is that all
values are (conceptually) always accessed by a pointer.final
variables in Java.)But wait --- the last bullet may appear to be a lie, because look:
- val y = 5; val y = 5 : int; - val y = 6; val y = 6 : int;
What's going on? Is the y
binding getting
modified? Well, actually, no --- the second declaration is
shadowing the earlier declaration.
Bindings in ML live in environments, and the "top-level" environment can be visualized conceptually as an ever-growing stack of bindings. Fig. 2 shows a diagram of the top-level environment resulting from the interactive ML session so far. There are several interesting things to note about this picture.
First, the second y
and the second it
binding are shadowed by later bindings: names in a given scope
always refer to the most recent binding with a matching
name; this binding hides any earlier bindings with the same
name.
This may seem like it doesn't matter, but only because we've so far only been dealing with the top-level environment. The top-level environment corresponds, roughly, to the "global" scope in C-like languages. Bindings at top-level are available anywhere that they are not shadowed by some other binding. We'll discuss other environments shortly.
Second, x
and the shadowed it
share a
pointer to the same 3
value. When a binding is
assigned a value, conceptually the pointer to that value
is copied to the new binding. All values in ML are implicitly
by-reference.
Third, this picture only shows the logical picture of data in memory. The implementation may optimize how it represents values in various ways, provided the behavior is indistinguishable from the behavior in this picture. For example, it can discard unused or shadowed bindings, if it can prove that those bindings can never be accessed again. It may also have a special, more efficient representation for pointers-to-integers --- such as the integers themselves. (It is a useful thought exercise to consider why this representation optimization is safe. Remember that most ML values, including integers, are immutable.)
OK, making a new val
doesn't modify bindings; what
about assignment? Suppose a Java programmer forgets for a moment
that this is ML, and tries to assign a different value to
y
using =
:
- y = 10; val it = false : bool - y; val it = 3 : int
What's going on? Well, for one thing, =
does
not mean assignment in ML. Actually, you cannot perform
assignment on ML bindings at all --- as previously noted, they are
immutable. The expression y = 10
is a
comparison, which evalutes to the boolean value
false
. This is why SML/NJ prints val it =
false
, and why y
is unchanged.
Computation in ML, as in all functional languages, proceeds primarily by evaluating expressions. Assignment and with other "side effects" of evaluation play a much smaller role in functional languages than in imperative languages. Code without side effects is said to be purely functional, or simply pure.
Most of the code we write in this class will be pure. One of
the important lessons of functional programming is that side
effects are rarely necessary. In fact, some languages, such
as Haskell, are completely pure (side-effect free). Functional
programming advocates claim that code that extensively employs
side effects tends to be confusing and harder to reason about
(both automatically and manually) than pure code. When you see a
function call f(x)
, and you know that f
is a pure function, then you don't have to worry about "hidden"
consequences --- the only thing the call does is produce its
return value. If f
has side effects, then you must
remember what those side effects are, and what order they happen
relative to other side effects, etc.
You can apply this lesson even in non-functional languages: for
example, in Java, make as many fields and variables
final
as you can.
If you're used to languages like Java, ML's val
declarations should look slightly odd to you. In Java, you might
write:
int a = 5; float b = 5.0; char c = '5'; String d = "5";
Notice that the syntax of declarations requires that the
programmer always explicitly specify the type. ML's syntax
doesn't require this, because ML has a type
inference system. Generally, ML will determine the types
of names and values based on how you use them. You only need to
declare the types of names explicitly in certain cases when the
type inference algorithm doesn't have enough information to do it
automatically. To write down a value's type explicitly is to
ascribe the type to the value; in ML, the syntax
for ascription is expr:type
or
name:type
, e.g.:
- 5:int; val it = 5 : int - val x:int = 5; val x = 5 : int - val x = 5:int; val x = 5 : int
Notice that you may ascribe the type after either the name or the initializing expression. Actually, type ascriptions can syntactically appear after (nearly) any value or declared name. ML's type inference algorithm "propagates" the ascribed type to other positions in the code that must have the same type.
For simple values like the ones we've seen so far, ascription
is never necessary, but we will eventually see examples where
types must be explicitly ascribed
(i.e., written
down).
(Side note: In some cases, ML programmers ascribe types even where it's not necessary --- either for documentation, or to give a value a "more specific" type than the inference algorithm will infer by itself.)
What if the programmer ascribes an incorrect type?
- val z:char = 5; stdIn:1.1-40.4 Error: pattern and expression in val dec don't agree [literal] pattern: char expression: int in declaration: z : char = 5
Short answer: if the ascriptions cause the inference algorithm to assign an invalid type to an expression, then a type error results. We'll discuss this in more detail when we cover type inference and polymorphism.
ML has several families of built-in data types; these include:
You should be familiar with these fundamental types from Java, but in ML all these built-in types are immutable. If you want to "alter" one of these compound values, you must create a new value that copies all the components except the field or position you want to change; that field/position should contain the updated value.
ML has special syntactic support for constructing and
manipulating its built-in types. This is one of the reasons ML
code is much more compact than C or Java code. Each family of
built-in types has a constructor syntax that
constructs a value of appropriate type from that family. (In ML,
a constructor for a type t
is a function that takes
zero or more arguments and constructs a fresh value of
t
.)
Records resemble structs in C, or method-less objects in Java;
they are constructed by writing a list of one or more field
assignments name = value
in between two curly braces
{}
. Here are some examples:
- val foo = {x = 3}; val foo = {x=3} : {x:int} - val bar = {x = 3, y = true}; val bar = {x=3,y=true} : {x:int, y:bool} - val baz = {x = "hi", y = foo}; val baz = {x="hi",y={x=3}} : {x:string, y:{x:int}} - val boo = {foo = #"h", bar = "i", baz = 123.0}; val boo = {bar="i",baz=123.0,foo=#"h"} : {bar:string, baz:real, foo:char}
{x:int}
) is written a comma-separted list of one or
more field declarations name:type
in between curly
braces. In general, the syntax of types in ML closely mirrors the
syntax for constructing values of those types.
Record types are equivalent if they have exactly the same field names and types. A record of one type cannot be assigned to a record of a different type:
- val aPoint:{x:int, y:int} = {x = 1.0, y = 2.2}; stdIn:1.1-50.20 Error: pattern and expression in val dec don't agree [tycon mismatch] pattern: {x:int, y:int} expression: {x:real, y:real} in declaration: aPoint : {x:int, y:int} = {x=1.0,y=2.2} - val simpleRecord:{x:int} = {x = 1, y = 2}; stdIn:55.1-55.42 Error: pattern and expression in val dec don't agree [tycon mismatch] pattern: {x:int} expression: {x:int, y:int} in declaration: simpleRecord : {x:int} = {x=1,y=2}
Notice that, unlike objects in a language like Java, a record value cannot be "implicitly promoted" to a record with fewer fields. In other words, ML does not have subtype polymorphism.
Fields of a record value are accessed using the special
function #fieldName applied to
recordValue
:
- val r = {x=1, y=2}; val r = {x=1,y=2} : {x:int, y:int} - #x(r); val it = 1 : int
Side note: What happens if you put zero fields in a record?
- {}; val it = () : unit
Oops. That doesn't look like a record type --- that's
unit
. In my opinion, this is a bug in ML. However,
see below on the empty tuple.
Tuples work a lot like records, except that the fields have an explicit order; and instead of using field names, you use positions to access the members.
Tuples are constructed simply by enclosing a comma-separated
list of two or more values in round parentheses
()
:
- (1, 2); val it = (1,2) : int * int - ("foo", 25, #"b", false); val it = ("foo",25,#"b",false) : string * int * char * bool
As you can see, tuple types are written as a
*
-separated sequence of types: type1 *
type2 * ... * typeN
.
The Kth element of a N-tuple can be accessed by the special
accessor function #K
, as follows:
- val x = (54, "hello"); val x = (54,"hello") : int * string - val firstX = #1(x); val firstX = 54 : int - val secondX = #2(x); val secondX = "hello" : string
Side note: What happens if you put one element in parens? Zero?
- (1); val it = 1 : int - (); val it = () : unit
In my opinion, unlike the empty record case, these make sense.
As in other languages, parentheses group terms that should be
evaluated before other terms. Rather than constructing a 1-tuple,
which is useless, (expr)
evaluates
expr
before any surrounding expressions and
returns it. Also, viewing unit
as a "zero-tuple"
makes more sense to me than viewing empty records as
unit
, though I can't justify this opinion with
anything other than my arbitrary taste.
Linked lists are the bread and butter of functional
programming. (Perhaps recursive, higher-order functions are the
knife and fingers.) ML lists are homogeneous
; that
is, all elements must have the same type. The type of a list of
elements of type t
is written "t
list
", e.g. int list
or string
list
. For any type t
, a t list
has two constructors:
nil
, the empty list (also written
[]
)::
(pronounced "cons", terminology borrowed
from Lisp), which is an infix operator that
constructs a single list cell from its left and right arguments.
The left argument must be of some type t
, and the
right argument must be of some type t list
.
Intuitively, this should be familiar; in a Java-like language, a
node in a singly linked list whose elements have type
T
would usually be defined as follows:
class TListNode { T value; TListNode next; }
Lists may also be constructed from a comma-separated list of
values inside square brackets []
. This is
syntactic sugar for a sequence of conses; and, in
fact, when you type a list of conses at the repl, SML/NJ will
answer using this sugared syntax.
- val x = 1::nil; val x = [1] : int list - val y = 1::2::3::nil; val y = [1,2,3] : int list - val z = 4::x; val z = [4,1] : int list
A picture of the data structres in memory that result from the above three declarations is shown in Fig. 3.
Note the following:
nil
.z
is well-typed because the
4
is an int
, and x
is an
int list
.z
uses the list value
bound to x
directly as its "tail". This is safe
because lists are immutable.The first element of a list can be obtained using the function
hd
("head"), and the rest of a list can be obtained
using tl
("tail"). Note that, in functional
programming terminology, the tail is the entire rest of
the list after the head, not the last element (think tadpoles, not
dogs). Calling hd
or tl
on an empty
list results in a runtime error (exception).
- hd([1,2,3]); val it = 1 : int - hd(tl([1,2,3])); val it = 2 : int - hd(tl(1::nil)); uncaught exception Empty raised at: boot/list.sml:36.38-36.43
Q: What is the type of a bare nil
?
- nil; val it = [] : 'a list
What is this 'a
business? In ML, a type whose
name begins with a single quote character is a type
variable which means, roughly, "any type can be
substituted here". Types with type variables are called
polymorphic types. nil
is actually
a polymorphic value, i.e. it has polymorphic
type; this must be so, because lists of all types share
nil
as the terminating value.
The polymorphism in ML's type system is actually one of its best features. We will describe this in more detail as the quarter goes on; for now, we'll work mostly with lists with some concrete element type.
As depicted in the figures in the previous section, all ML values are accessed by reference, a.k.a. by pointer. When a value is bound to a name or stored in another data structure, the pointer to that value is copied to the appropriate location, not the value itself.
Uniformly accessing variables by reference greatly simplifies program understanding. In languages where values can be "inline" rather than by-reference, there are complex and confusing rules for how and when values are implicitly copied, and what happens when these implicit copies occur.
(If you're familiar with C++, consider the uses of copy constructors, or what happens when you copy a value of type T to a stack-allocated value belonging to one of T's superclasses.)
All ML's data values are first-class citizens, meaning that all values have "equal rights": they can all be passed to functions, returned from functions, bound to names, stored as elements of other values, etc.
One consequence is that in ML, as in most reasonable languages, compound types can be nested arbitrarily. You can have lists of tuples, tuples of lists, or records of lists of tuples of records of tuples, etc., because a compound type can be used anywhere an atomic type can be used. This is an example of ML's high degree of orthogonality:
- val a = [{x=1,y=2},{x=3,y=4}]; val val = [{x=1,y=2},{x=3,y=4}] : {x:int, y:int} list - val b = ("hello", [#"w", #"o", #"r", #"l", #"d"], #"!"); val b = ("hello",[#"w",#"o",#"r",#"l",#"d"],#"!") : string * char list * char - val c = {name=("Keunwoo", "Lee"), = classes=["341","590dg","590l"], = age=26}; val c = {age=26,classes=["341","590dg","590l"],name=("Keunwoo","Lee")} : {age:int, classes:string list, name:string * string}
Exercise: try writing code in Java, or your favorite other programming language, that constructs objects that are roughly equivalent to the above three values. How many lines does it take?
In the above, we alluded to the fact that the top-level environment was not the only environment. Let expressions are one way to introduce local environments, which produce names that are visible only in a local scope.
Let expressions have the form let decls in
expr end
, where decls
is a
semicolon-separated sequence of declarations and expr is some
expression that may optionally use the names bound in decls.
Names bound in a let-expression are only visible to later bindings in
the same let-expression, and inside the body expression. Outside the
scope of the let-expression, the bindings are no longer visible. For
example:
- let val x = 5 in x + x end; val it = 10 : int - let = val localA = "hello"; = val localB = "+++++++"; = val localB = ", "; = val localC = localB ^ "world" = in = localA ^ localC (* XXX *) = end; val it = "hello, world" : string - localA; stdIn:88.1-88.9 Error: unbound variable or constructor: localA - let = val earlierBinding = laterBinding + 1; = val laterBinding = 5 = in = earlierBinding + laterBinding = end; stdIn:120.24-120.36 Error: unbound variable or constructor: laterBinding
Order of bindings matters:
These are really the same rules that apply in the top-level environment. All environments in ML work the same way. This is an example of ML's high degree of regularity: there are no special rules for top-level versus local environments.
A diagram of the local environment at the point marked
XXX
is given in Fig. 4.
All expressions are first-class, and let expressions are expressions. Therefore, let expressions can be nested, and more generally may appear anywhere other expressions may appear:
- val longLetExpr = = let = val aString = let val x = "hi, "; val y = "there" in x ^ y end; = val anInt = 17 = in = (anInt, let val period = "." in aString ^ period end) = end; val longLetExpr = (17,"hi, there.") : int * string