CSE 341: Unsafe languages (C)

Until now, all the languages we've studied in this class have been type safe: it is not possible to corrupt the heap by using a piece of memory as an incorrect type.

However, a large fraction of the world's programming is done in C and C++, languages that are not type safe, for various reasons. Sometimes programmers need features that most high-level languages do not currently provide:

Often, programmers who believe they must program in unsafe languages are actually mistaken. For example, it is often the sufficient to implement certain performance-critical routines in C, and write the rest of the application in a high-level language.

But one might ask: why is it bad to program in C and C++, anyway? What makes these languages unsafe, and what does that mean? In these notes, we explore a few of C's unsafe features, and why programmers use them.

A brief tour of C and its type system

C has the roughly the same types of statements and control structures (if, while, for, etc.) as Java (minus objects). We'll focus on the important differences between C and Java, beginning with the type system.

C's base types, whose sizes vary depending on the target machine, include the following:

C's compound types include (but are not limited to) the following:

Unlike every other language we've studied so far, C uses explicit indirection --- rather than referring to all values by pointer, C distinguishes between direct values and indirect values:

For any value, the address (the location where the value is stored) can be obtained by applying the & operator:

struct point_t q = { 5, 6 };  /* Direct point_t value */
struct point_t *p = &q;   /* Pointer to q */

For any pointer value, *expr evaluates to the value pointed-to. expr->fieldName is a sugar for evaluating fields of pointed-to structures:

int x_coord = (*p).x;  /* Dereference/access of p */
int y_coord = p->y;    /* Sugar for field access through struct pointers */

On the left-hand side of an assignment, *expr evaluates to a location, which will be stored into:

struct point_t r = { 7, 8 };
(*p) = r;   /* Assignment through pointer; updates p */

Safety risks in C

In safe languages, values are managed "from the cradle to the grave":

  1. Objects are created and initialized in a type-safe way.
  2. An object cannot be corrupted during its life time: its representation is in accordance with its type.
  3. Objects are destroyed, and their memory reclaimed, in a type-safe way.

C does not have any of these protections:

  1. C heap values are created in a type-unsafe way.
  2. C casts, unchecked array accesses, and unsafe deallocation can corrupt memory during its lifetime.
  3. C deallocation is unsafe, and can lead to dangling pointers.

Casts

Type-safe languages use some combination of static and dynamic checks to ensure that types cannot be violated. No program can write an integer into a string location. Statically typed languages like ML rule out such accesses by carefully designing the type system to prove that no such operation can occur. Dynamically typed languages rule out such accesses by tagging values with their types, and then checking operations at runtime to make sure they apply to type-correct arguments.

C is not a type-safe language. In C, any type can be viewed at any other type simply by using a cast:

(type)expr

This operation causes expr to be treated as type, regardless of the type of expr.

Unlike Java casts, C casts are not dynamically checked for "truthfulness" --- if the values are incompatible, the compiler will "do the best it can", usually simply reinterpreting the in-memory bit pattern of expr as if it belonged to type.

For example, a pointer to a character can simply be cast to a pointer to a double:

void f(char* char_ptr) {
    double* d_ptr = (double*)char_ptr;
    (*d_ptr) = 3.5;
}

A standard C-code "exploit" (malicious attack) involves writing some "bad" instructions into some location in memory that will later be executed. These instructions might, for example, store the payload of a virus. Casts provide a trivial way to do store arbitrary instructions into code addresses:

/* pretend 0x010000 is a hexadecimal address of code */
int i = 0x010000;

/* treat this integer as a pointer, using a cast */
int * i_ptr = (int*)i;

/* store through this pointer --- placing bad data at address */
(*i_ptr) = 0xBADCAT;

Later, when the code jumps to 0x010000, the code will execute the instruction 0xBADCAT instead of whatever instruction was there before.

As we shall see, in practice this particular "cast attack" is not how most C programs get exploited by malicious programmers. Malicious programmers are more likely to exploit array bounds errors. However, there are idiomatic uses of C code for which bad casts often cause inadvertent bugs (as opposed to malicious ones).

Manual memory management

All the languages we've studied previously have made some strong, implicit guarantees about dynamic memory allocation and deallocation:

C provides none of these guarantees:

Allocation is type-unsafe: malloc

In C, dynamic memory allocation is not built into the language; rather, it is provided by a library routine named malloc, which has the following profile:

void * malloc(size_t size)

This profile bears some explaining:

So, malloc takes a memory size, and returns a chunk of memory of the requested size. Since malloc returns void*, the client programmer will typically cast its result to some useful type:

struct point_t * fresh_point =
    (struct point_t *)malloc(sizeof(struct point_t));

This newly allocated memory is uninitialized --- it is not even zero-filled --- so the programmer must initialize it:

fresh_point->x = 0;
fresh_point->y = 1;

Additionally, the argument to malloc is not checked against the type to which its result is cast, so it is possible to get it wrong:

struct point_t * pseudo_point = (struct point_t *)malloc(sizeof(char));

Deallocation is unsafe: free, stack allocation

In garbage-collected languages, deallocation is implicit. The programmer never explicitly tells the program to free the memory used by some object. This is (usually) good, because memory should only be reclaimed when it can never be used again, and programmers are not very good at figuring out by hand when memory can never be used again. Garbage collectors are better at this than humans: they only reclaim memory of objects they can prove will never be used.

C is not garbage collected. It has a free function:

void free(void* mem_ptr)

free deallocates the memory to which it points. This has obvious bad implications; for example:

int *i = (int*)malloc(sizeof(int));
int j = 0;
*i = 4;
free(i);
/* ... some code that might allocate memory, then: */
*i = j;  /* Use deallocated memory. */

Here, *i is referring to memory that has been deallocated. When you deallocate memory, it goes back into the pool from which malloc draws, so the memory that i points to could have been reallocated and currently in use for some other purpose.

This is called the dangling pointer problem: the pointer can point to invalid memory.

Stack allocation and dangling pointers

C does have one type of implicitly deallocated data. Because C has explicit indirection --- not all values are pointers --- C can store local data on the stack. In the C declaration:

struct point_t * f() {
    struct point_t p = {0, 1};
    struct point_t * q =
        (struct point_t *)malloc(sizeof(struct point_t));
    q->x = 2;
    q->y = 3;
    ...
    return &p;
}
[activation record of f; p has x=0, y=1 struct point_t
      inline, whereas q is a pointer to a heap-allocated value {2,
      3}]
Fig. 1: Activation record of f, showing distinction between p and q.

the value p is stored in the activation record for f, not as a pointer to a value in the heap. By constrast, the value q is a pointer to a dynamically allocated heap value, akin to an ML record {x=2, y=3}. See Fig. 1.

When a C function returns, its activation record is deallocated, just as in ML or Java*. However, the ability to store values directly on the heap, combined with C's & (address-of) operator, makes this form of automatic deallocation as unsafe as malloc/free.

Consider f, above: it returns the address of p, which will no longer be valid when the client accesses it. Hence, any function that saves the return value of f will have a dangling pointer:

struct point_t * dangling = f();

* Or, indeed, any language without first-class continuations.

Allocation is not necessarily private

Deallocation is unsafe, and C has other features (like casts) that allow you to corrupt arbitrary locations in memory. Therefore, when you receive a pointer to "fresh" memory from malloc, you actually have no guarantee that there are no other pointers into this data.

When you program in an unsafe language, no memory in the current process is ever truly private.

Why should allocation and deallocation be unsafe?

The short answer: there's no particularly good reason for allocation to be unsafe. In fact, C++, which aspires to be no more "safe" than C in most respects, replaces malloc with the new operator --- which takes a type instead of a raw size, and returns an appropriately typed pointer.

That leaves us with deallocation. As our free dangling pointer example shows, a language must be garbage collected in order to have safe deallocation.* If a programmer can reclaim arbitrary heap data manually, then the programmer can reclaim data that will be used again.

For many years, garbage collection was popularly perceived to be too impractical and slow for real applications. And, to some extent, this perception was true: machines were slow and primitive, and high-performance implementations of garbage collection were not widely available for systems that people wanted to use.

However, sometime on or around 1994, the world changed. For various reasons, the introduction of Java succeeded in convincing programmers and managers that garbage-collected languages had acceptabe performance costs. One of the major reasons is that programmer productivity, reliability, and security are now much more important than raw speed for most applications.

Manual memory management (using free or its equivalents) is still needed for certain niche, low-level applications. For example, for various reasons operating system kernels generally cannot be written in purely garbage-collected languages. However, only a small fraction of programmers write such software.

* This is true for arbitrary heap-allocated data --- we'll see on Wednesday that, provided you can live with some restrictions, you can regain type safety without full garbage collection.) For many years, garbage collection was popularly perceived to be too impractical and slow for real applications.

Arrays

We've only briefly discussed arrays and vectors in this class. However, all the languages we've discussed so far have safe arrays: you cannot read or write beyond the end of an array. This is enforced using dynamic checks: when you write the Smalltalk expression

anArray at: 123

Smalltalk will verify that the array has at least 123 elements before returning the 123rd. This feature is called array bounds checking.

C does not have automatic array bounds checking. The programmer must therefore check manually. that an index is within bounds before accessing array elements.

Of all the things we'll discuss today, this is probably the most important w.r.t. code exploits: most C program security holes are "buffer overruns", i.e. cases when a program forgets to check some the bounds for some buffer (array) and overruns the end.

In C, the following code will not raise a dynamic error:

int i_arr[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
i_arr[20] = 0xBADCAT;

In fact, even the following code will not raise an error:

i_arr[-8] = 0xBADCAT;

Some compilers may warn about a negative index when it's a direct constant, but this is a dumb check that can be circumvented trivially:

int j = readSomeUserInput();
i_arr[j] = 0xBADCAT;

In reality, all memory inside a process is one big linear array. Allowing code to read or write out-of-bounds array offsets amounts to allowing it to read or write any memory location, with corrupting results similar to those for casts.

This is especially bad for locally allocated arrays, because these arrays are stored into the local activation record, which contains the return address of the current activation. The return address points to the code that called the current function. When the current function completes execution, it will jump to the return address. By storing into a selected offset from a local variable, you can overwrite the return address, and make the program jump to arbitrary code.

Null-termination

The above problem with arrays is exacerbated by the fact that C has a peculiar (but useful) way of representing strings. In most high-level languages, arrays are represented in memory by a length field, followed by an array of characters of appropriate size, as in Fig. 2(a).

In C, strings (arrays of bytes) are often represented by an array of characters terminated by the null character '\0', which is the byte with the bit-pattern consisting of only zeros. See Fig. 2(b).

[high-level language array: [27,'h','e',...,'\n']; C
      null-terminated array: ['h','e',...,'\n','\0',?,...?] (with
      garbage characters following null
Fig. 2: Representation of immutable string in a high-level language; representation of null-terminated string in C.

Many C library routines for manipulating strings, like strcpy "string copy") take null-terminated arrays as arguments. The problem is that these functions do not take maximum lengths as arguments. If strcpy is given arrays of mismatched length, it will happily copy its data onto the shorter array, overwriting the end of the array:

/* An array of five uninitialized characters */
char arr[5];

/* Literal strings have an implicit null-terminator. */
char arr_2[] = "world..............long data!";

/* strcpy copies from its second argument onto its first
   until it hits the null-terminator at the end of the second. */
strcpy(arr, arr2);

Most array-bounds attacks are based on feeding some carefully crafted oversize string as input to a program (e.g., a web server), which leads to an inadvertent buffer overrun when that string gets copied somewhere else. The string then overwrites some "good" code with malicious code. Once that code gets executed, the attack is complete.

There are "safer" alternatives to the unsafe C functions (e.g., strncpy, which takes an additional parameter indicating the maximum length of the target string) but programmers still use the old functions; and new functions dealing with null-terminators are written all the time. Fundamentally, the problem is that the string does not carry the length of its buffer, so the bounds cannot be checked, and the programmer must check them manually.

Why else do C programmers use casts?

C has no parametric polymorphism

In ML, if you were to write a function that sorts a list, it would probably have the type:

val sort : 'a list -> ('a * 'a -> bool) -> 'a list

That is, it would take a list of some type, and a function that compared the elements of the list for ordering (i.e., a "greater than" function), and return a fresh list.

This type relies on parametric polymorphism --- it has the type variable 'a. This type tells ML's type system to verify that the arguments of the comparison function have the same type as the elements of the list.

C has no parametric polymorphism. But C does have a generic sorting function for arrays, which has the following profile:

void qsort(void * base,
                int nmemb,
                size_t size,
                int (*compar) (void *, void *));

The syntax is rather cryptic (and I've actually cleaned this up slightly for presentation). Let's go through each argument in turn:

Here's an example of using this function:

int double_compar(double * d1, double * d2) {
    if (*d1 > *d2) {
        return 1;
    } else if (*d1 < *d2) {
        return -1;
    } else {
        return 0;
    }
}

void test() {
    double[10] d_arr = ...;

    /* Sorts elements of d_arr in place.
       In this call, d_arr is implicitly cast to void*,
       and double_compar is implicitly cast to
       int (*) (void*,void*). */
    qsort(d_arr, 10, sizeof(double), double_compar);
}

As the comments state, this call relies on implicit casts to void*. There is nothing preventing us from passing in mismatched functions, or incorrect element sizes (or, for that matter, incorrect array bounds):

int char_compar(char* c1, char* c2) {
    ...
}

void test2() {
    double d[10] d_arr = ...;

    /* This call is wrong for at least three reasons. */
    qsort(d_arr, 20, sizeof(int), char_compar);
}

C has no subtype polymorphism

C structs do not support subtyping of any kind. Therefore, the following two structs are not assignable:

struct point_t {
    int x;
    int y;
};

struct point3d_t {
    int x;
    int y;
    int z;
};

In order to simulate subtyping in C, programmers use casts on the pointer types:

point3d_t p = { 0, 1, 2 };
point_t *q = (point_t *)&p;
int qx = q->x;
int qy = q->y;

This works because C guarantees that structs with identical prefixes of fields will place their fields at the same offset from the beginning of the struct value.

Lessons

Big lessons of C:

  1. Unrestricted manual memory management is unsafe.
    (Corollary: Garbage collection is a big win.)
  2. The ability to allocate data directly on the stack, and then take its address, is unsafe.
    (Corollary: The uniform reference model of data is a big win.)
  3. Unchecked array accesses are unsafe.
    (Corollary: Array bounds checking is a big win.)
  4. Polymorphic types are not merely an academic feature --- if you don't have subtyping and parametric polymorphism, you will need either code duplication or some form of casts. Both duplication and casts lead to bugs, and casts are unsafe.

On Wednesday, we'll learn that sophisticated language technology can complicate the first two of the above points. Cyclone, a research language for "C-level programming", combines type safety with the ability to perform low-level memory management and use stack-allocated data.