CSE341 Notes for Monday, 4/16/10

We continued our discussion of type safety. One final example is that C and C++ don't check the bounds for simple arrays. You might have an array that is declared to have 100 elements, but C and C++ allow you to ask for element 5000 or element -250. It simply computes a location in memory and interprets the bits it finds there as being of the element type of the array. That's one of the reasons that it is so common for C and C++ programs to crash with a "segmentation fault", because the language allows you to make potentially dangerous references to random parts of memory.

We ran this short program to explore that:

        #include <iostream>
        using namespace std;
        
        int main() {
          int data[100];
          int n;
          cout << "n? ";
          cin >> n;
        
          cout << data[n] << endl;
          data[n] = 3;
          cout << "have a nice day" << endl;
        
          return 0;
        }

This program printed different values on different executions for all values of n and when n was outside the bounds of the array, we often got a segmentation fault.

These are examples of ways in which C and C++ are not type safe. You should not be able to refer to variables that were never initialized, you should not be able to arbitrarily reinterpret bits in a different way and you shouldn't be able to reach into random parts of memory and to treat it as any kind of data you like.

These concerns became far more important when the world wide web came along because all of a sudden we wanted to include applets in our web pages. To convince a user to run an arbitrary bit of code on their computer, you have to give them some kind of assurance that your program is well behaved. That was possible to do with Java because the designers of the language took type safety very seriously. The were able to make a convincing case that Java programs ran in a "sandbox" that severely limited the potential damage they could do. It's almost impossible to give similar guarantees about code written in languages like C and C++.

It was interesting to see Microsoft deal with this issue in the 1990's. At first they were experimenting with Java, but when they started making changes to the libraries Sun sued them. Instead, Microsoft designed a new language called C# that ended up looking a lot like Java. C# can make similar guarantees of type safety with one notable exception. C# has a special keyword "unsafe" that can be attached to methods and classes. This allows C# programs to work with legacy code written in unsafe languages, but programmers are encouraged to avoid unsafe code whenever possible.

Then I mentioned that I wanted to discuss a isue of polymorphism in ML. I asked people to consider the following function that should behave like the built-in length function:

        fun len(lst) =
            if lst = [] then 0
            else 1 + len(tl(lst));

When I loaded this in the ML interpreter, I got a warning and a an odd type notation:

        fri.sml:2.12 Warning: calling polyEqual
        val len = fn : ''a list -> int

The warning is generated by line 2 (in fact, character 12 of line 2 is what the "2.12" means). That happens because we have written this function in such a way that it depends on recognizing the equality of two different expressions. Many types in ML can be compared for equality, but not all. For example, we got an error when we went into the interpreter and asked:

        - 3.8 = 3.8;
        stdIn:1.1-1.10 Error: operator and operand don't agree [equality type required]
          operator domain: ''Z * ''Z
          operand:         real * real
          in expression:
    3.8 = 3.8

ML does not allow you to compare values of type real for equality. The reasoning is that floating point numbers are stored as approximations, not as exact representations, so you shouldn't use a strict equality operation.

So the warning is letting us know that we have written the function in such a way that we can apply it only to lists of equality types. We would not be able to use it on a list of real values. ML indicates that with the double apostrophe on the generic type. Instead of 'a, ML describes it in turns of ''a.

In general, you want write your functions so that they don't have this limitation. There is no reason that you can't write a the length function in such a way that it will be general. But sometimes you'll be writing a more specific kind of function where this limitation isn't a problem. In fact, in some cases you won't be able to avoid it because part of the work of the function is to compare values for equality.

I then talked about how to implement a function called member that would return true or false depending upon whether a particular value is a member of a list. I asked what kind of lists would make it easy to answer this question and someone said an empty list, in which case the answer is false, so we began with:

        fun member(x, []) = false
        ...

I asked people whether the variable x is used in this case and everyone said no. In ML, when you're not using the value of a variable, it is customary to use an anonymous variable instead which we indicate with an underscore:

        fun member(_, []) = false
        ...

I then asked if any other cases would be easy? Someone said that if the list begins with the value you're looking for, then we'd know its a member, so we tried saying:

        fun member(_, []) = false
        |   member(x, x::xs) = true

And what if it doesn't occur at the beginning of the list? Then we search the rest of the list for it:

        fun member(_, []) = false
        |   member(x, x::xs) = true
        |   member(x, y::ys) = member(x, ys);

When I loaded this definition into ML, we got an error message:

        Error: duplicate variable in pattern(s): x

Pattern matching is limited in what it can handle. In particular, it can't figure out this kind of match where the same variable is used in two different patterns. But we can do the same kind of thing ourselves with a boolean expression:

        fun member(_, []) = false
        |   member(x, y::ys) = (x = y) orelse member(x, ys);

This is a correct implementation of the function. When we loaded it into ML we the polyEqual warning:

wed.sml:7.11 Warning: calling polyEqual
val member = fn : ''a * ''a list -> bool

The warning is okay in this case because it is implicit in the nature of member that it has to perform an equals comparison.

Then I said that I wanted to discuss an efficiency issue. I began with the concept of tail recursion . I said to consider a simple counting function:

        fun f1(n) =
            if n = 0 then 0
            else 2 + f1(n - 1);

This is a silly function to write because it just computes 2 * n, but it will allow us to perform an experiment. I then asked people to think about how we might write something like this with a loop. Someone said that we'd use some kind of sum variable, so it might look like this:

        int sum = 0;
        for (int i = 0; i < n; i++) {
            sum += 2;
        }

I said that for convenience, I wanted to rewrite this to go backwards:

        int sum = 0;
        for (int i = n; i > 0; i--) {
            sum += 2;
        }

Several times I've tried to make the point that you can turn this kind of loop code into a functional equivalent. If it was useful for the loop to have an extra variable for storing the current sum, then we can do the same thing with a helper function. We can have a 2-argument function that keeps track of the current sum in addition to the value of i. Using that idea, I wrote the following variation of f1:

        fun f2(n) =
            let fun helper(0, sum) = sum
        	|   helper(i, sum) = helper(i - 1, sum + 2)
            in helper(n, 0)
            end;

They both compute 2 * n in a similar manner, but they have very different behavior in the interpreter. The f1 function ran noticeably slower than f2, especially when we used very large input values like f1(5000000) vs f2(5000000). Why would that be? Think about what happens when we compute f1(5):

        f1(5) =
        2 + f1(4) =
        2 + 2 + f1(3) =
        2 + 2 + 2 + f1(2) =
        2 + 2 + 2 + 2 + f1(1) =
        2 + 2 + 2 + 2 + 2 + f1(0) =
        2 + 2 + 2 + 2 + 2 + 0 = 10

Notice how the computation expands as we make recursive calls. After we reach the base case, we'll have a lot of computing left to do on the way back out. But notice the pattern for f2:

        f2(5) =
        helper(5, 0) =
        helper(4, 2) =
        helper(3, 4) =
        helper(2, 6) =
        helper(1, 8) =
        helper(0, 10) = 10

There is no expansion to the computation. The key thing to notice is that once we reach the base case, we have the overall answer. There is no computation left as we come back out of the recursive calls. This is a classic example of tail recursion. By definition, a tail recursive function is one that performs no additional computation after the base case is reached.

It is well known that tail recursive functions are easily written as a loop. Functional languages like Scheme and ML optimize tail recursive calls by internally executing them as if they were loops (which avoids generating a deep stack of function calls).

I also mentioned that the versions of map, filter and reduce that I've shown and that appear in the Ullman book are not tail-recursive. The standard operators like List.map, List.filter, List.foldl and List.foldr are written in a tail-recursive manner to make them more efficient.

Stuart Reges

Last modified: Fri Apr 16 17:02:00 PDT 2010