CSE341 Lecture Notes 2: Introduction to ML

ML is clean and powerful, and has many traits that language designers consider hallmarks of a good high-level language:

ML is the "exemplary" statically typed, strict functional programming language.

History of ML

Fig. 1 gives an abbreviated family DAG of the ML family, and a few related languages. Dotted lines indicate significant omitted nodes. The rounded box indicates those variants of the ML family that most people would call ML. A brief history of ML:

Miranda and Haskell are statically typed lazy (as opposed to strict) functional languages, with many similarities to ML (including ML-like polymorphic type systems).

EML and Cyclone are research languages devised by people who are currently at UW, or have been in the past. (We will discuss these languages towards the end of the quarter.) They are marked as descending from the entire ML family because the distinctions between, e.g., SML and O'Caml are not important w.r.t. the way ML influenced these languages.

The general ideas of ML have been highly influential in the research community; if we enumerated all the ML dialects or relatives that researchers have devised over the years, the dotted line to the lower right of the figure would probably have hundreds of descendants.

The interactive ML interpreter

For this class, we'll use the SML/NJ implementation of ML97. Like most ML implementations, SML/NJ provides a read-eval-print loop ("repl"), so named because the interpreter repeatedly performs the following:

The primary advantage of programming in a repl is immediate feedback. The read-eval-print cycle is much faster than the edit-compile-run cycle in a typical compiled programming environment. You can quickly and easily experiment with different snippets of code. If a function doesn't work, you can try out a different version in a second or two, and re-run your program. This makes interactive repls ideal for "exploratory" programming.

(Often in the course of my teaching, a student has presented me with a code snippet and asked: "What happens if I write X? Or XY? Or, how about XYZ?" Of course, the best way to find out is simply to write X, Y, and Z, and then run the various combinations. But in a compiled environment, you have to create a new file, and repeatedly compile each different version of the program. In a repl, it's easy to quickly experiment interactively with all these variations.)

Of course, typing long chunks of code repeatedly can be tedious, so the repl allows you to load source from a file with the use function, which takes a string filename and loads the contents of the named file as though it were typed into the interpreter directly. (You can also use Unix pipes, or programming environments like Emacs sml-mode, to send code into the interpreter.)

Expressions, values, and bindings

All programming languages allow users to manipulate data, and all useful languages provide two kinds of data:

We'll start with atomic data. Here's the result of entering some expressions that evaluate to atomic data into the SML/NJ read-eval-print loop:

The dash is the SML/NJ prompt indicating that it's waiting for you to type in an expression or a declaration. When you type in an expression followed by a semicolon, SML/NJ parses the expression, then evaluates it to a value. Then it prints that value, along with its type. The above values are of type bool, int, real, char, and string respectively.

There are many operators defined over atomic types, including most of the ones you'd expect. See Ullman sections 2.1-2.2 and ch. 9.1 for information about these. Minor surprises:

Technical note: Values are expressions that are "done evaluating". Therefore, 3 is a value, whereas 3 + 4 is not a value, because this expression can evaluate one more step, to 7.

Technical note 2: ML also has an odd atomic type called unit. unit has only one value, which is written () (empty parens):

unit plays a role similar to (but not identical to) that of void in other languages --- for example, functions that don't have a meaningful return value will have return type unit. The difference is that () is a real value --- one that can be bound to names, passed to functions, etc., just like any other value. We'll discuss the relative merits of unit vs. void more when we discuss functions.

val bindings

But what is this val it = 3 business? In order to explain this, we must first examine bindings, which resemble what other languages call "variables". Bindings are declarations; the val declaration binds a value to a name. A bound name can then be used later to refer to the value that was bound to it:

When you do not bind an expression to a name at the top-level interpreter prompt, it gets bound to the name it by default. This is not a feature of ML per se; it's just a helpful feature of the SML/NJ repl. If you want to prevent this, you can bind the value to the wildcard, _ (single underscore):

Notice that the interpreter does not print the

val it =
    ...

after the wildcard binding, and that it is unchanged afterwards. The wildcard _ is not a variable name; it's a placeholder that means, "evaluate this as if you were binding it to a name, but instead throw it away". We'll revisit wildcards in more depth when we discuss pattern matching.

Name bindings resemble variable declarations in a language like C or Java, with several important differences:

What's going on? Is the y binding getting modified? Well, actually, no --- the second declaration is shadowing the earlier declaration.

Bindings in ML live in environments, and the "top-level" environment can be visualized conceptually as an ever-growing stack of bindings. Fig. 2 shows a diagram of the top-level environment resulting from the interactive ML session so far. There are several interesting things to note about this picture.

First, the second y and the second it binding are shadowed by later bindings: names in a given scope always refer to the most recent binding with a matching name; this binding hides any earlier bindings with the same name.

This may seem like it doesn't matter, but only because we've so far only been dealing with the top-level environment. The top-level environment corresponds, roughly, to the "global" scope in C-like languages. Bindings at top-level are available anywhere that they are not shadowed by some other binding. We'll discuss other environments shortly.

Second, x and the shadowed it share a pointer to the same 3 value. When a binding is assigned a value, conceptually the pointer to that value is copied to the new binding. All values in ML are implicitly by-reference.

Third, this picture only shows the logical picture of data in memory. The implementation may optimize how it represents values in various ways, provided the behavior is indistinguishable from the behavior in this picture. For example, it can discard unused or shadowed bindings, if it can prove that those bindings can never be accessed again. It may also have a special, more efficient representation for pointers-to-integers --- such as the integers themselves. (It is a useful thought exercise to consider why this representation optimization is safe. Remember that most ML values, including integers, are immutable.)

Aside: What about assignment?

OK, making a new val doesn't modify bindings; what about assignment? Suppose a Java programmer forgets for a moment that this is ML, and tries to assign a different value to y using =:

What's going on? Well, for one thing, = does not mean assignment in ML. Actually, you cannot perform assignment on ML bindings at all --- as previously noted, they are immutable. The expression y = 10 is a comparison, which evalutes to the boolean value false. This is why SML/NJ prints

val it =
    false

, and why y is unchanged.

Computation in ML, as in all functional languages, proceeds primarily by evaluating expressions. Assignment and with other "side effects" of evaluation play a much smaller role in functional languages than in imperative languages. Code without side effects is said to be purely functional, or simply pure.

Most of the code we write in this class will be pure. One of the important lessons of functional programming is that side effects are rarely necessary. In fact, some languages, such as Haskell, are completely pure (side-effect free). Functional programming advocates claim that code that extensively employs side effects tends to be confusing and harder to reason about (both automatically and manually) than pure code. When you see a function call f(x), and you know that f is a pure function, then you don't have to worry about "hidden" consequences --- the only thing the call does is produce its return value. If f has side effects, then you must remember what those side effects are, and what order they happen relative to other side effects, etc.

You can apply this lesson even in non-functional languages: for example, in Java, make as many fields and variables final as you can.

A few words on type inference

If you're used to languages like Java, ML's val declarations should look slightly odd to you. In Java, you might write:

Notice that the syntax of declarations requires that the programmer always explicitly specify the type. ML's syntax doesn't require this, because ML has a type inference system. Generally, ML will determine the types of names and values based on how you use them. You only need to declare the types of names explicitly in certain cases when the type inference algorithm doesn't have enough information to do it automatically. To write down a value's type explicitly is to ascribe the type to the value; in ML, the syntax for ascription is expr:type or name:type, e.g.:

Notice that you may ascribe the type after either the name or the initializing expression. Actually, type ascriptions can syntactically appear after (nearly) any value or declared name. ML's type inference algorithm "propagates" the ascribed type to other positions in the code that must have the same type.

For simple values like the ones we've seen so far, ascription is never necessary, but we will eventually see examples where types must be explicitly ascribed (i.e., written down).

(Side note: In some cases, ML programmers ascribe types even where it's not necessary --- either for documentation, or to give a value a "more specific" type than the inference algorithm will infer by itself.)

Incorrect type ascriptions

Short answer: if the ascriptions cause the inference algorithm to assign an invalid type to an expression, then a type error results. We'll discuss this in more detail when we cover type inference and polymorphism.

Built-in compound data types

You should be familiar with these fundamental types from Java, but in ML all these built-in types are immutable. If you want to "alter" one of these compound values, you must create a new value that copies all the components except the field or position you want to change; that field/position should contain the updated value.

ML has special syntactic support for constructing and manipulating its built-in types. This is one of the reasons ML code is much more compact than C or Java code. Each family of built-in types has a constructor syntax that constructs a value of appropriate type from that family. (In ML, a constructor for a type t is a function that takes zero or more arguments and constructs a fresh value of t.)

Records

Records resemble structs in C, or method-less objects in Java; they are constructed by writing a list of one or more field assignments name = value in between two curly braces {}. Here are some examples:

As you can see, a record type (e.g., {x:int}) is written a comma-separted list of one or more field declarations name:type in between curly braces. In general, the syntax of types in ML closely mirrors the syntax for constructing values of those types.

Record types are equivalent if they have exactly the same field names and types. A record of one type cannot be assigned to a record of a different type:

- val aPoint:{x:int, y:int} = {x = 1.0, y = 2.2};
stdIn:1.1-50.20 Error: pattern and expression
    in val dec don't agree [tycon mismatch]
  pattern:    {x:int, y:int}
  expression:    {x:real, y:real}
  in declaration:
    aPoint : {x:int, y:int} = {x=1.0,y=2.2}

- val simpleRecord:{x:int} = {x = 1, y = 2};
stdIn:55.1-55.42 Error: pattern and expression
    in val dec don't agree [tycon mismatch]
  pattern:    {x:int}
  expression:    {x:int, y:int}
  in declaration:
    simpleRecord : {x:int} = {x=1,y=2}

Notice that, unlike objects in a language like Java, a record value cannot be "implicitly promoted" to a record with fewer fields. In other words, ML does not have subtype polymorphism.

Fields of a record value are accessed using the special function #fieldName applied to recordValue:

- val r = {x=1, y=2};
val r = {x=1,y=2} : {x:int, y:int}
- #x(r);
val it = 1 : int

Side note: What happens if you put zero fields in a record?

- {};
val it = () : unit

Oops. That doesn't look like a record type --- that's unit. In my opinion, this is a bug in ML. However, see below on the empty tuple.

Tuples

Tuples work a lot like records, except that the fields have an explicit order; and instead of using field names, you use positions to access the members.

Tuples are constructed simply by enclosing a comma-separated list of two or more values in round parentheses ():

- (1, 2);
val it = (1,2) : int * int
- ("foo", 25, #"b", false);
val it = ("foo",25,#"b",false) : string * int * char * bool

As you can see, tuple types are written as a *-separated sequence of types: type1 * type2 * ... * typeN.

The Kth element of a N-tuple can be accessed by the special accessor function #K, as follows:

- val x = (54, "hello");
val x = (54,"hello") : int * string
- val firstX = #1(x);
val firstX = 54 : int
- val secondX = #2(x);
val secondX = "hello" : string

Side note: What happens if you put one element in parens? Zero?

- (1);
val it = 1 : int
- ();
val it = () : unit

In my opinion, unlike the empty record case, these make sense. As in other languages, parentheses group terms that should be evaluated before other terms. Rather than constructing a 1-tuple, which is useless, (expr) evaluates expr before any surrounding expressions and returns it. Also, viewing unit as a "zero-tuple" makes more sense to me than viewing empty records as unit, though I can't justify this opinion with anything other than my arbitrary taste.

Lists

Linked lists are the bread and butter of functional programming. (Perhaps recursive, higher-order functions are the knife and fingers.) ML lists are homogeneous; that is, all elements must have the same type. The type of a list of elements of type t is written "t list", e.g. int list or string list. For any type t, a t list has two constructors:

nil, the empty list (also written [])
:: (pronounced "cons", terminology borrowed from Lisp), which is an infix operator that constructs a single list cell from its left and right arguments. The left argument must be of some type t, and the right argument must be of some type t list. Intuitively, this should be familiar; in a Java-like language, a node in a singly linked list whose elements have type T would usually be defined as follows:
```
class TListNode {
    T value;
    TListNode next;
}
```

Lists may also be constructed from a comma-separated list of values inside square brackets []. This is syntactic sugar for a sequence of conses; and, in fact, when you type a list of conses at the repl, SML/NJ will answer using this sugared syntax.

- val x = 1::nil;
val x = [1] : int list
- val y = 1::2::3::nil;
val y = [1,2,3] : int list
- val z = 4::x;
val z = [4,1] : int list

A picture of the data structres in memory that result from the above three declarations is shown in Fig. 3.

[ML top-level environment and heap: x = 1::nil, y =
1::2::3::nil, and z = 4::x.]

Figure 3: ML top-level environment and data structures in heap resulting from list construction.

Note the following:

Lists have finite length, so the last element must always be nil.
The value bound to z is well-typed because the 4 is an int, and x is an int list.
The list constructed for z uses the list value bound to x directly as its "tail". This is safe because lists are immutable.

The first element of a list can be obtained using the function hd ("head"), and the rest of a list can be obtained using tl ("tail"). Note that, in functional programming terminology, the tail is the entire rest of the list after the head, not the last element (think tadpoles, not dogs). Calling hd or tl on an empty list results in a runtime error (exception).

- hd([1,2,3]);
val it = 1 : int
- hd(tl([1,2,3]));
val it = 2 : int
- hd(tl(1::nil));

uncaught exception Empty
  raised at: boot/list.sml:36.38-36.43

Q: What is the type of a bare nil?

- nil;
val it = [] : 'a list

What is this 'a business? In ML, a type whose name begins with a single quote character is a type variable which means, roughly, "any type can be substituted here". Types with type variables are called polymorphic types. nil is actually a polymorphic value, i.e. it has polymorphic type; this must be so, because lists of all types share nil as the terminating value.

The polymorphism in ML's type system is actually one of its best features. We will describe this in more detail as the quarter goes on; for now, we'll work mostly with lists with some concrete element type.

Uniform reference data model

As depicted in the figures in the previous section, all ML values are accessed by reference, a.k.a. by pointer. When a value is bound to a name or stored in another data structure, the pointer to that value is copied to the appropriate location, not the value itself.

Uniformly accessing variables by reference greatly simplifies program understanding. In languages where values can be "inline" rather than by-reference, there are complex and confusing rules for how and when values are implicitly copied, and what happens when these implicit copies occur.

(If you're familiar with C++, consider the uses of copy constructors, or what happens when you copy a value of type T to a stack-allocated value belonging to one of T's superclasses.)

All values are first-class citizens

All ML's data values are first-class citizens, meaning that all values have "equal rights": they can all be passed to functions, returned from functions, bound to names, stored as elements of other values, etc.

One consequence is that in ML, as in most reasonable languages, compound types can be nested arbitrarily. You can have lists of tuples, tuples of lists, or records of lists of tuples of records of tuples, etc., because a compound type can be used anywhere an atomic type can be used. This is an example of ML's high degree of orthogonality:

- val a = [{x=1,y=2},{x=3,y=4}];
val val = [{x=1,y=2},{x=3,y=4}] : {x:int, y:int} list
- val b = ("hello", [#"w", #"o", #"r", #"l", #"d"], #"!");
val b = ("hello",[#"w",#"o",#"r",#"l",#"d"],#"!")
    : string * char list * char
- val c = {name=("Keunwoo", "Lee"), 
=          classes=["341","590dg","590l"],
=          age=26};
val c = {age=26,classes=["341","590dg","590l"],name=("Keunwoo","Lee")}
     : {age:int, classes:string list, name:string * string}

Exercise: try writing code in Java, or your favorite other programming language, that constructs objects that are roughly equivalent to the above three values. How many lines does it take?

Let-expressions and nested environments

In the above, we alluded to the fact that the top-level environment was not the only environment. Let expressions are one way to introduce local environments, which produce names that are visible only in a local scope.

Let expressions have the form let decls in expr end, where decls is a semicolon-separated sequence of declarations and expr is some expression that may optionally use the names bound in decls. Names bound in a let-expression are only visible to later bindings in the same let-expression, and inside the body expression. Outside the scope of the let-expression, the bindings are no longer visible. For example:

- let val x = 5 in x + x end;
val it = 10 : int
- let
=   val localA = "hello";
=   val localB = "+++++++";
=   val localB = ", ";
=   val localC = localB ^ "world"
= in
=   localA ^ localC                     (* XXX *)
= end;
val it = "hello, world" : string
- localA;
stdIn:88.1-88.9 Error: unbound variable or constructor: localA
- let
=   val earlierBinding = laterBinding + 1;
=   val laterBinding = 5
= in
=   earlierBinding + laterBinding
= end;
stdIn:120.24-120.36 Error: unbound variable or constructor: laterBinding

[ML local let environment with bindings (from top to
bottom): localC =

Figure 4: Contents of local let-environment at point XXX.

Order of bindings matters:

Later bindings are not visible to earlier ones
Later bindings shadow earlier bindings with the same name.

These are really the same rules that apply in the top-level environment. All environments in ML work the same way. This is an example of ML's high degree of regularity: there are no special rules for top-level versus local environments.

A diagram of the local environment at the point marked XXX is given in Fig. 4.

Again: all values are first-class

All expressions are first-class, and let expressions are expressions. Therefore, let expressions can be nested, and more generally may appear anywhere other expressions may appear:

- val longLetExpr =
=   let
=     val aString = let val x = "hi, "; val y = "there" in x ^ y end;
=     val anInt = 17
=   in
=     (anInt, let val period = "." in aString ^ period end)
=   end;
val longLetExpr = (17,"hi, there.") : int * string

CSE 341: Introduction to ML

Why ML?

History of ML

The interactive ML interpreter

Expressions, values, and bindings

`val` bindings

Aside: What about assignment?

A few words on type inference

Incorrect type ascriptions

Built-in compound data types

Records

Tuples

Lists

Uniform reference data model

All values are first-class citizens

Let-expressions and nested environments

Again: all values are first-class

CSE 341: Introduction to ML

Why ML?

History of ML

The interactive ML interpreter

Expressions, values, and bindings

val bindings

Aside: What about assignment?

A few words on type inference

Incorrect type ascriptions

Built-in compound data types

Records

Tuples

Lists

Uniform reference data model

All values are first-class citizens

Let-expressions and nested environments

Again: all values are first-class

`val` bindings