Module Systems and ML

Why module systems?

A module system is a mechanism for breaking a program into parts. There are roughly two reasons for module systems to exist:

  1. Separate reasoning: the ability to break down a system into parts, so that

    One hopes that understanding the recomposition is simpler than understanding all the parts' internals in combination at once.

    The complexity in a system with N parts is roughly O(N2) --- because there are N2 ways to connect N parts. Since modules are smaller than the whole program, and the number of modules is (hopefully) smaller than the number of finer-grained constructs (functions, data types, etc.), you can reduce the number of possible interconnections by using a module system judiciously.

  2. Separate compilation: the ability to break a system into parts, so that

    This process, adding up the costs of all individual module compilations and the link step, is typically more expensive than simply compiling the entire program in one go. However, separate compilation has the advantage that the entire program need not be recompiled when only one module changes. In practical large-scale development, separate compilation is extremely important.

    Separate compilation is a complex topic that is beyond the scope of a first course in programming languages. The one thing we will note here is that a necessary (but not sufficient) condition for separate compilation is separate typechecking: it must be possible to typecheck a module in isolation (delare it type-correct), without forcing the compiler to access the details of how other modules are implemented.

Key to Separate Reasoning: Encapsulation and interfaces

Encapsulation (a.k.a. information hiding) = A module only reveals what its clients need to know, and hides (or encapsulates, or abstracts) the rest.

The notion of hiding some parts of a module implies that some parts are exposed. The exposed part is usually called the interface of the module. An interface imposes constraints on how the client can use the module spoken of; how these constraints work is perhaps the key problem in module system design.

Note: some people seem to think encapsulation and information hiding are different. I have never seen anyone define a technical distinction between the two.

A bit of terminology

Language designers usually distinguish between the core language and the module language. The core language is what we've learned in ML prior to this lecture --- mechanisms for constructing and manipulating computations over ordinary values. The module language consists of mechanisms for arranging and combining core language constructs on a larger scale.

To some extent, features of module systems can be applied freely and independently to various core languages. For example, the notion of bundling core language declarations inside a module for namespace management is applicable in nearly every language. However, more complex module mechanisms typically depend on properties of the core language.

Why a separate module system?

Note our definition of module system: a mechanism for breaking a program into parts. This is pretty vague. Let's suppose that we want a mechanism for

Well, we already have a mechanism in the core language for putting a bunch of values into a bundle: records. Let's use a record to bundle up a bunch of functions for manipulating lists (ignore for the moment the fact that ML's syntax doesn't allow in-line definitions of recursive functions inside records; this is not a fundamental barrier, since we could just make up some syntax):

val List = {
    map     = fn ...,
    reverse = fn ...,
    ...
};

If this value were provided by default in the top-level environment, we could then refer to the map function as #map(List). Syntactically ugly, perhaps, but otherwise fine: it is "the map function in the List module".

This has at least two major drawbacks:

  1. Record types cannot hide fields. Recall that a record value's type must include the name and types of all its fields. So, for example, if the programmer of the List record wanted to define "private" function in the course of implementing several of these records, then the programmer would not be able to leave that function out of the record type.
  2. Record members can only be values, not types. You could not, for example, bundle up the list data type in the same record with its functions.

Both of these problems could be solved by extending the capabilities of the record data type, but they would lead to such fundamental changes, and such an increase in complexity, that you would lose the pleasing simplicity of the ML record type.

Similar drawbacks apply to using any other core language construct for modules.

ML modules

ML modules are sequences of declarations inside a struct expression; struct expressions must be bound to a name using a structure declaration:

- structure S = struct
    val x = 3
    type t = int * string
    datatype d = D of t
    fun f y = (y, Int.toString x)
end;
structure S :
  sig
    datatype d = D of t
    type t = int * string
    val f : 'a -> 'a * string
    val x : int
  end

The declared types, as well as the types of the structure's values, are reflected in the signature, which the SML/NJ interpreter echoed back between sig ... end. Note that values are erased in the signature; only types remain. A signature is "the type of a structure".

To refer to a structure member, use the syntax structExpr.name, e.g.:

- S.x;
val it = 3 : int
- S.f;
val it = fn : 'a -> 'a * string
- val (a:S.t) = (3, "hi");
val a = (3,"hi") : S.t
- S.D a;
val it = D (3,"hi") : S.d

This simplest use of modules allows namespace management: in other words, programmers can group related names together inside a module. This is a very weak (although indispensible) form of encapsulation --- it only prevents accidental name clashes (a name is only revealed to clients who ask for it by accessing it through the structure path).

Structures can be nested, with the obvious semantics:

- structure Outer = struct
  val x = 5
  structure Inner = struct
    val x = "hi"
  end
end;
structure Outer :
  sig
    structure Inner : 
    val x : int
  end

- val outerX = Outer.x;
val outerX = 5 : int
- val innerX = Outer.Inner.x;
val innerX = "hi" : string

Signature ascription for abstraction

When a structure is declared, ML automatically generates a principal signature for the structure --- the one that reflects all its bindings and types. However, the programmer may wish to hide parts of this signature, in order to abstract away (encapsulate) some details.

This is done using signature ascription, which is rather analogous to type ascription, except that there are two kinds of signature ascription: transparent and opaque. Both forms of ascription check that the principal signature matches the ascribed signature. The difference is in what each reveals about types:

Transparent signature ascription for modules is written with a :, like type ascription for values:

- structure STransparent:sig
    type t;
    type d;
    val f:int -> t
end = S;
structure STransparent :
  sig
    type t = int * string
    datatype d = ...
    val f : int -> t
  end

Note that the signature allows the definitions of t and d to "show through" the ascription.

Opaque ascription is written with a :>

- structure SOpaque :> sig
    type t;
    type d;
    val f:int -> t
end = S;
structure SOpaque :
  sig
    type t
    type d
    val f : int -> t
  end

Note that t and d are now fully abstract --- nothing is known about them.

Both transparent and opaque ascription hide any types or values that are not declared in the signature at all. In the example above, both ascribed structures omit the x value.

Signatures can be bound to names using the special signature declaration:

- signature Sig = sig
    type t
    type d
    val f:int -> t
end;
signature Sig =
  sig
    type t
    type d
    val f : int -> t
  end

Opening modules and local

Sometimes it is inconvenient to refer to module members by the full structure path. To save typing and visual clutter, ML provides an open declaration, which dumps a module's members into the scope in which the open declaration appears:

- open Outer.Inner;
opening Outer.Inner
  val x : string

- structure OpenDemo = struct
    open Outer;
end;
structure OpenDemo :
  sig
    structure Inner : 
    val x : int
  end

Generally, it is a bad idea to open a module, because it's difficult to keep track of what names will be dumped into the current environment.

If you must open a module, it is best to limit the extent of the resulting names. ML has a local construct for this purpose. local declarations have the syntax

local openDecls in bodyDecls end

The open statements in openDecls make the names in the opened structures visible for the declarations in bodyDecls. Then, bodyDecls are added to the scope containing the local declaration.

- structure ToOpen = struct val a = "foo" end;
structure ToOpen : sig val a : string end

- local
    open ToOpen
  in
    val b = a
  end;
val b = "foo" : string

- b;
val it = "foo" : string

local declarations are rather like a "declaration-level let" (Notice that let-expressions have no mechanism for adding declarations to the surrounding environment.)

Functors: Parameterized modules (or, functions over modules)

At the expression level, functions abstract over expressions. Likewise, at the module level, we would like some mechanism to abstract over modules --- if we have a common "module pattern" for more than one combinations of types and values, then we would like to write that pattern once, and apply it to several modules.

In ML, a functor is a "function over structures". Functor declarations have the form

functor name(params):optionalSig = structBody

name is the name of the functor, params are the the formal functor parameters, which will typically be structures; structure parameters must have ascribed signatures, because parameters' signatures are not inferred at the module level. As with structures, the user may optionally provide a signature for the functor, optionalSig. Finally, structBody is the body of the functor.

In core language functions, the body is an expression, and the parameter names are available as expressions in the body. In module language, the body is a structure; and the functor parameter names are available in the functor signature and body as as modules.

Ullman's textbook defines the BTreeFun functor shown in abbreviated form below along with the helper signature TOTALORDER:

signature TOTALORDER = sig
  type element
  val lt : element * element -> bool
end;

functor MakeBST(Lt:TOTALORDER):
  sig
    type 'a btree
    val create : Lt.element btree;
    val lookup : Lt.element * Lt.element btree -> bool
    val insert : Lt.element * Lt.element btree
                   -> Lt.element btree
  end
  =
  struct
    open Lt

    datatype 'a btree =
             Empty
           | Node of 'a * 'a btree * 'a btree;

    val create = Empty

    fun lookup(x, Empty) = false
      | lookup(x, Node(v, left, right)) =
        if Lt.lt(x, v)      then lookup(x, left)
        else if Lt.lt(v, x) then lookup(x, right)
        else true

    fun insert(x, Empty) = Node(x, Empty, Empty)
      | insert(x, n as Node(v, left, right)) =
        if Lt.lt(x, v)    then Node(v, insert(x, left), right)
        else if Lt.lt(v, x) then Node(v, right, insert(x, right))
        else n
  end;

Leaving aside signatures for the moment, this works much like you'd expect a function over structures to work:

This is analogous to having a function at the expression level which takes a record, and then accesses record members inside the function body. Since structures (unlike records) can contain types, it is also useful for the parameters to be accessible

To instantiate this functor into a structure, you need a structure satisfying signature TOTALORDER:

- structure StringLt : TOTALORDER =
struct
  type element = string
  val lt:(string * string -> bool) = op <;
end;
structure StringLt : TOTALORDER

- structure StringBST = MakeBST(StringLt);
structure StringBST :
  sig
    datatype 'a btree = ...
    val create : Lt.element btree
    val lookup : Lt.element * Lt.element btree -> bool
    val insert : Lt.element * Lt.element btree -> Lt.element btree
  end

Functors vs. records of functions

Previously, we achieved results very similar to the above by defining a datatype that held a record of functions:

datatype 'a BTNode = Empty | Node of 'a * 'a BTNode * 'a BTNode;
datatype 'a BTree = Tree of {lt:'a * 'a -> bool, root:'a BTNode}

This was simpler than the functor-based approach above. Furthermore, in both approaches, the programmer has the flexibility to write search trees with any data type, and even provide different function(s) that compare values. Why would we ever want to use functors?

This question becomes even more pressing when we consider the complexity of functor semantics. We have only scratched the surface in the example above --- remembering the full semantics of functors exactly is challenging even for Ph.D.'s in programming languages. (In fact, functors are a rather controversial topic in the programming languages community.)

Nevertheless, there are at least two answers to why we'd want to use functors:

Modules are not first class

Module values and module expressions are not first-class. In order to bind a module to a name, you must use the special structure binding rather than a val binding. You cannot have a "conditional module" expression:

structure S = if someComplicatedExpression
              then struct type T = ... end
              else struct type T = ... end;

This is not an accident. There are proposals to make modules first-class, but there are subtle issues with typechecking (which are beyond the scope of this course) and there is some disagreement within the languages community as to whether it's even desirable to make modules first-class.