[ ^ CSE 341, Winter 2004 home page | Lectures index ]

CSE 341: Mutable and optional data

In languages like Java (or C/C++), every object reference (or pointer) may be null, and by default can be mutated.

In ML, as we have seen, all references to data default to being immutable (unchangeable), and always point to some value; an int * string tuple must contain an integer value and a string value. This simplifies reasoning considerably:

When you extract parts of the value, you do not have to worry about whether you'll get a value out.
When you construct a value, you know that it will retain that value forevermore --- you cannot pass the value to some other part of the program, which will update it behind your back, thereby breaking some important invariant of your data structure.

However, sometimes you need mutation (the ability to update a value) and optional-ness. ML provides data types in the standard library that provide them. Optional data is fairly straightforward. Mutable data, however, represent a significant departure from what we've covered previously.

Optional data

Usually, when you have a data type that requires an "empty" case, you will define a customized constructor for that data type --- for example, our polymorphic tree:

datatype 'a Tree = Empty | Node of 'a * 'a Tree * 'a Tree

However, sometimes it's annoying to define a new data type whenever something is optional. What if you want to define a find function over lists that only optionally returns a value? You could define a new datatype:

datatype 'a FindResult = NotFound | Found of 'a

and then find could have type

(('a -> bool) * 'a list) -> 'a FindResult

But this is overkill; and you would have to do it for every function that might optionally return an empty value. So ML provides a standard polymorphic library datatype option:

datatype 'a option = NONE | SOME of 'a

This is used the same way that any other datatype is used:

- val v = SOME 5;
val v = SOME 5 : int option

- fun find f nil = NONE
  | find f (x::xs) =
    if f x then SOME x else find f xs;
val find = fn : ('a -> bool) -> 'a list -> 'a option

case find (fn x => x > 0.0) [~2.5, 0.0, ~4.4, 30.0, ~15.0] of
    NONE => "No value"
  | SOME v => "Found: " ^ (Real.toString v)
val it = "Found: 30.0" : string

There's also a standard function valOf that is defined as follows:

fun valOf NONE = raise Option
  | valOf SOME s = s;

You, as a user, can choose whether to use pattern-matching over both cases, or raise an exception in the case of none. There's also a getOpt function that allows you to provide a default value to be returned in the NONE case:

- getOpt (NONE, ~1);
val it = ~1 : int

Why not option everywhere?

Note that we could have used option instead of defining multiple cases for our tree data:

datatype 'a Tree = Node of ('a * 'a Tree * 'a Tree) option;

In this representation, the argument of Node is optional; an empty value is represented as follows:

- Node NONE;
val it = Node NONE : 'a Tree

A non-empty tree is represented using SOME:

- Node (SOME (10, Node NONE, Node NONE));
val it = Node (SOME (10,Node NONE,Node NONE)) : int Tree

This is more cumbersome, obviously. But actually, this is how many languages --- e.g., Java and C --- typically encode data types with an "empty" case. This is because in such languages, all pointers can be null. Consider the Java tree node class:

public class Node {
    final Object v;
    final Node left, right;
    public Node(Object val, Node left, Node right) {
       this.val = val; this.left = left; this.right = right;
    }
}

What is an empty tree? It is an empty Node reference:

final Node n = null;

A tree with two empty children uses two null pointers:

final Node m = new Node("hi", null, null);

Therefore, in Java-like languages, every reference to a type T is really a reference to a type "T option". This means that the programmer always has to consider whether some value might be null and lead to a null pointer exception.

Mutable data

Mutable data is handled in ML primarily using the 'a ref polymorphic datatype, which has a single constructor, ref:

- ref;
val it = fn : 'a -> 'a ref
- val x = ref 5 : int ref
val x = ref 5;

ref allocates a fresh mutable (alterable/assignable) reference which can be read or changed (the value is sometimes called a ref cell). For any value v of type T ref, you can perform two operations:

Dereference the value with the operator ! (exclamation point), producing a value of type T:

op !;
val it = fn : 'a ref -> 'a

!x;
val it = 5 : int

- val i:int = x;
stdIn:18.1-18.14 Error: pattern and expression in val
    dec don't agree [tycon mismatch]
  pattern:    int
  expression:    int ref
  in declaration:
    i : int = x

Update the value using the assignment operator :=, as follows:

- op :=;
val it = fn : 'a ref * 'a -> unit

- x := 10;
val it = () : unit

- !x;
val it = 10 : int

[Diagrams of memory after ref allocation and update]

Fig. 1: Diagrams of memory after

val
      x = ref 5

and x := 10.

Note that this does not alter the binding --- bindings are immutable. The binding continues to point to the same ref cell; it is only the contents of the cell that are updated.

Fig. 1 shows how allocation and updating work. The ref constructor allocates a cell and fixes an initial value. The := operation updates the value in the cell, making it point to a different integer.

The fact that x points to the same ref cell should become clear when we produce an alias to the same ref cell (another pointer that points to the same location:

- val y = x;
val y = ref 10 : int ref
- x := 20
val it = () : unit
- y := 30;
val it = () : unit
- !x;
val it = 30 : int

ref values are first class --- they can be parts of any value, in the usual way:

- val name = {first=ref "Keunwoo", last=ref "Lee"};
val name = {first=ref "Keunwoo",last=ref "Lee"}
  : {first:string ref, last:string ref}

- #last(name);
val it = ref "Lee" : string ref

- #last(name) := "Kim";
val it = () : unit

- name;
val it = {first=ref "Keunwoo",last=ref "Kim"}
  : {first:string ref, last:string ref}

In languages like Java or C, essentially all bindings --- including object fields, local variables, and class variables --- are actually bound to refs, because they can be updated. In fact, in Java, all non-final object references are actually references to options, because they point to updatable locations that may be null.

(Thought question: what is the difference between a int option ref and a int ref option?)

This is another example of ML's clean design and orthogonality --- you do not get "more than you asked for" in a type, but you can freely combine properties like mutability or optional-ness when you want them.

Iteration in ML

Suppose you wanted to write an iterative sumList function instead of a recursive one. Now that we have assignment, we can do so --- it looks like this:

fun sumList aList =
    let
        val sum = ref 0
        val current = ref aList
    in
        (while not (null(!current))
         do (sum := hd(!current) + !sum;
             current := tl(!current));
         !sum)
    end;

Note our use of the (expr; ... ;expr) expression sequence syntax. Even allowing some ugliness for the fact that ML forces you to put lots of dereferences, I claim this is clearly uglier than the recursive version, even taking into account the tail-recursion conversion.

Suggested exercise: try to write map, filter, and foldl using iteration. Which do you prefer, the iterative or recursive formulations of these functions?

The polymorphic ref problem

Mutable data brings us to an interesting and rather type system problem. Suppose we could have a value of type 'a ref (note: the following is not legal ML code, for reasons we'll discuss shortly):

val x:'a list ref = ref [];

Seems to make perfect sense: [] has type 'a list (it's a polymorphic value), so we should be able to allocate a ref cell and assign that to a binding of type 'a list ref. But now suppose we have the following code:

fun f y = x := y;
f [17];

Since x has the type 'a list ref, the function f ought to have the type 'a list -> unit, and the body of f ought to typecheck --- we're updating the contents of 'a list ref with a value of type 'a list.

We should then be able to apply f to the value [17] by instantiating f's type to int list ref -> unit. Evaluation of f [17] results in the list value [17] becoming the target of x's ref cell.

Now, suppose we do this:

fun g () = !x;
val y:bool list = g();
if hd(y) then "hi" else "bye";

(Pretend you don't know about f and f [17], because the typechecker doesn't.) This code ought to typecheck as well! Consider the body of g: it dereferences x, which has type 'a list ref. Therefore, g should get type unit -> 'a list (the return type is the result type from dereferencing a 'a list ref).

Now, when we bind the result of 'a list to a bool list binding, we simply instantiate 'a with bool, so that binding is well-typed.

Finally, we take the head of y and use it as a boolean value. But, supposing we executed f [17] as we did above, the head of y will not be a boolean value --- it will be an integer. We have just violated type safety. This is known as the "polymorphic ref problem" and comes up wherever we have mutation and polymorphism together.

Where did we go wrong?

ML's answer is that we should not allow the type 'a list ref for a val binding, because it could be instantiated later with two different types for 'a --- which, as we've shown, can lead to writing the ref cell at one type, and reading it at another.

More generally, ML strongly restricts the introduction of polymorphic types for val bindings. For a binding

val name = expr

name is given polymorphic type only if expr is a syntactic value. Recall that a value is an expression that is "done" evaluating --- a syntactic value is a syntactic representation of an immutable value. Syntactic values include only the following kinds of expressions:

Literal constants.
Anonymous function expresions (fn ... => ...).
Constructors of immutable types applied to expressions that are (recursively) syntactic values.

Note that function calls are not included. This rule is called the value restriction. It suffices to make sure that you're not creating mutable locations, either directly (by constructing a mutable location) or indirectly (e.g., by calling a function that constructs a ref cell).

When you get a polymorphic type from a non-syntactic-value expression, and attempt to bind it to a name, ML will instantiate the polymorphic type with a dummy type. This is why ML gives an error when you write:

- val x = ref NONE;
stdIn:46.1-46.17 Warning: type vars not generalized because of
   value restriction are instantiated to dummy types (X1,X2,...)
val x = ref NONE : ?.X1 option ref

Recall that NONE has polymorphic type 'a option. ref NONE therefore, naively, has type 'a option ref; but this is not a syntactic value, so the 'a, rather than being "passed through" to the type of x, is instantiated with a fresh, non-polymorphic dummy type that SML/NJ prints as ?.X1.

Arrays

ML has other updatable data structures, including arrays, which work similarly to refs. Array functions are found in the Array structure (we haven't covered structures, but for now think of a structure as something like a Java package or a C++ namespace):

- Array.array;
val it = fn : int * 'a -> 'a array

- val array = Array.array(10, 0);
val a = [|0,0,0,0,0,0,0,0,0,0|] : int array

- val b = Array.fromList [1, 2, 3];
val b = [|1,2,3|] : int array

- Array.update(a, 0, 1);
val it = () : unit

- a;
val it = [|1,0,0,0,0,0,0,0,0,0|] : int array

- Array.sub(a, 0);
val it = 1 : int

ML also has an immutable array type, called vector. You might wonder: if you have vector and ref, why do you need arrays? Couldn't you just have a ref vector? The answer is yes---

- Vector.fromList [1, 2, 3];
val it = #[1,2,3] : int vector

- val c = Vector.fromList [ref 1, ref 2, ref 3];
val it = #[ref 1,ref 2,ref 3] : int ref vector

- Vector.sub(c, 0) := 4;
val it = () : unit

- !(Vector.sub (c, 0));
val it = 4 : int

[Diagrams of int ref vector and int array]

Fig. 2: Comparison of vector of int refs and int array.

The problem with this is that using ref cell has some overhead compared to using an ordinary value reference; and it is quite challenging to remove this overhead in the general case. The naive implementation of a vector of ref cells is shown in Fig. 2.

Because programs that use arrays (for example, numerical programs) typically require high time and space performance in array operations, this cost was considered prohibitive. ML chose to compromise its "purity" and offer an Array data type that stands for a direct array of mutable locations.