[ ^ CSE 341, Winter 2004 home page | Lectures index ]

CSE 341: Unsafe languages (C)

Until now, all the languages we've studied in this class have been type safe: it is not possible to corrupt the heap by using a piece of memory as an incorrect type.

However, a large fraction of the world's programming is done in C and C++, languages that are not type safe, for various reasons. Sometimes programmers need features that most high-level languages do not currently provide:

explicit control over low-level issues like the exact layout of objects in memory
extremely high-performance, which may be difficult to achieve with high-level languages (for example, array-bounds checks may be unacceptably expensive for some scientific computations)
the ability to implement runtime systems for high-level languages (it's hard to implement a garbage collector in a garbage-collected language)
interfaces to code that can only interoperate conveniently and efficiently with a low-level language (for example, it may be expensive or inconvenient to pass objects from a garbage collected language into a library that performs manual memory management)

Often, programmers who believe they must program in unsafe languages are actually mistaken. For example, it is often the sufficient to implement certain performance-critical routines in C, and write the rest of the application in a high-level language.

But one might ask: why is it bad to program in C and C++, anyway? What makes these languages unsafe, and what does that mean? In these notes, we explore a few of C's unsafe features, and why programmers use them.

A brief tour of C and its type system

C has the roughly the same types of statements and control structures (if, while, for, etc.) as Java (minus objects). We'll focus on the important differences between C and Java, beginning with the type system.

C's base types, whose sizes vary depending on the target machine, include the following:

char: one byte; 8 bits on all modern machine architectures. The name is somewhat deceptive: char is often used for raw machine bytes, not characters per se, and conversely modern software often uses 16 bits for application-level characters (due to the needs of multilingual software).
```
char c1 = 'a';
char c2 = 127;   /* max signed character value on a standard machine */
```
int: an integer, whose size depends on the native word size on the target machine; on a 32-bit machine, this is 32 bits (4 bytes), whereas on a 64-bit machine this is usually 64 bits (8 bytes)
float and double: the low- and high-precision floating point sizes (on a 32-bit machine, usually at least 4 and 8 words respectively, but this varies, as with int)
Various other sizes of integers: short, long, and others, which may or may not be different from int depending on the target machine

C's compound types include (but are not limited to) the following:

Arrays: for any type T, T[] is an array of T. For example, int[] is an array of integers. Unlike Java, the declaration syntax puts the brackets after the variable name, not on the type:

int x[10];         /* array of 10 integers */
char buf[MAX_BUF]; /* array of MAX_BUF characters */

/* A function that takes a char array of unknown size.
   Unspecified-size arrays cannot be declared as locals; see below. */
void f(char x[], int len);

Records, called "structs", which are declared with the struct keyword and a list of field declarations in curly braces:
```
struct StructTag {
    type1 fieldName1;
    type2 fieldName2;
    ...
    typeN fieldNameN;
}
```
For example:
```
struct point_t {
    int x;
    int y;
};
```
C uses nominal (name-based) type equivalence --- two different struct declarations are incompatible, even if they are identical. Two different struct declarations may not have the same type name.

Structure initialization statements take a list of initialization expressions for the fields between braces:
```
struct point_t origin = { 1, 0 };  /* initialize x to 1, and y to 0 */
```
Order of fields matters; the initialization expressions are assigned to the fields of the struct in turn.

Notice that we write struct point_t above, not just point_t. Due to a quirk of C syntax, declarations of struct values have the form struct StructTag, not StructTag.

Unlike every other language we've studied so far, C uses explicit indirection --- rather than referring to all values by pointer, C distinguishes between direct values and indirect values:

For any T, T* is a pointer to a value of type T:

struct point_t *p_ptr = NULL;  /* pointer to a point */
int ** int_ptr_ptr = NULL;     /* pointer to a pointer to an int */

For any value, the address (the location where the value is stored) can be obtained by applying the & operator:

struct point_t q = { 5, 6 };  /* Direct point_t value */
struct point_t *p = &q;   /* Pointer to q */

For any pointer value, *expr evaluates to the value pointed-to. expr->fieldName is a sugar for evaluating fields of pointed-to structures:

int x_coord = (*p).x;  /* Dereference/access of p */
int y_coord = p->y;    /* Sugar for field access through struct pointers */

On the left-hand side of an assignment, *expr evaluates to a location, which will be stored into:

struct point_t r = { 7, 8 };
(*p) = r;   /* Assignment through pointer; updates p */

Safety risks in C

In safe languages, values are managed "from the cradle to the grave":

Objects are created and initialized in a type-safe way.
An object cannot be corrupted during its life time: its representation is in accordance with its type.
Objects are destroyed, and their memory reclaimed, in a type-safe way.

C does not have any of these protections:

C heap values are created in a type-unsafe way.
C casts, unchecked array accesses, and unsafe deallocation can corrupt memory during its lifetime.
C deallocation is unsafe, and can lead to dangling pointers.

Casts

Type-safe languages use some combination of static and dynamic checks to ensure that types cannot be violated. No program can write an integer into a string location. Statically typed languages like ML rule out such accesses by carefully designing the type system to prove that no such operation can occur. Dynamically typed languages rule out such accesses by tagging values with their types, and then checking operations at runtime to make sure they apply to type-correct arguments.

C is not a type-safe language. In C, any type can be viewed at any other type simply by using a cast:

(type)expr

This operation causes expr to be treated as type, regardless of the type of expr.

Unlike Java casts, C casts are not dynamically checked for "truthfulness" --- if the values are incompatible, the compiler will "do the best it can", usually simply reinterpreting the in-memory bit pattern of expr as if it belonged to type.

For example, a pointer to a character can simply be cast to a pointer to a double:

void f(char* char_ptr) {
    double* d_ptr = (double*)char_ptr;
    (*d_ptr) = 3.5;
}

A standard C-code "exploit" (malicious attack) involves writing some "bad" instructions into some location in memory that will later be executed. These instructions might, for example, store the payload of a virus. Casts provide a trivial way to do store arbitrary instructions into code addresses:

/* pretend 0x010000 is a hexadecimal address of code */
int i = 0x010000;

/* treat this integer as a pointer, using a cast */
int * i_ptr = (int*)i;

/* store through this pointer --- placing bad data at address */
(*i_ptr) = 0xBADCAT;

Later, when the code jumps to 0x010000, the code will execute the instruction 0xBADCAT instead of whatever instruction was there before.

As we shall see, in practice this particular "cast attack" is not how most C programs get exploited by malicious programmers. Malicious programmers are more likely to exploit array bounds errors. However, there are idiomatic uses of C code for which bad casts often cause inadvertent bugs (as opposed to malicious ones).

Manual memory management

All the languages we've studied previously have made some strong, implicit guarantees about dynamic memory allocation and deallocation:

Allocation is type correct: One allocates memory of some given type; this memory is guaranteed to be of suitable size for its intended use.
Allocation is private: This is a consequence of the above two points: No other part of the program possesses a pointer into freshly allocated data except the procedure that requested the allocation.
Deallocation is safe: The garbage collector will not deallocate any data that may be used again --- it only collects unreachable data.

C provides none of these guarantees:

Allocation is type-unsafe: `malloc`

In C, dynamic memory allocation is not built into the language; rather, it is provided by a library routine named malloc, which has the following profile:

void * malloc(size_t size)

This profile bears some explaining:

void* is C's type meaning "a pointer of unknown type" --- one might call it "void-star-polymorphism". void* resembles Object, except that there are no values of type void* (it is purely a placeholder type) and casts from void* to some other type are not checked dynamically.
size_t is C's type for the size of some memory object. One usually obtains values of this type by applying the special "function" sizeof, which determines the size of a given type at compile time.

So, malloc takes a memory size, and returns a chunk of memory of the requested size. Since malloc returns void*, the client programmer will typically cast its result to some useful type:

struct point_t * fresh_point =
    (struct point_t *)malloc(sizeof(struct point_t));

This newly allocated memory is uninitialized --- it is not even zero-filled --- so the programmer must initialize it:

fresh_point->x = 0;
fresh_point->y = 1;

Additionally, the argument to malloc is not checked against the type to which its result is cast, so it is possible to get it wrong:

struct point_t * pseudo_point = (struct point_t *)malloc(sizeof(char));

Deallocation is unsafe: `free`, stack allocation

In garbage-collected languages, deallocation is implicit. The programmer never explicitly tells the program to free the memory used by some object. This is (usually) good, because memory should only be reclaimed when it can never be used again, and programmers are not very good at figuring out by hand when memory can never be used again. Garbage collectors are better at this than humans: they only reclaim memory of objects they can prove will never be used.

C is not garbage collected. It has a free function:

void free(void* mem_ptr)

free deallocates the memory to which it points. This has obvious bad implications; for example:

int *i = (int*)malloc(sizeof(int));
int j = 0;
*i = 4;
free(i);
/* ... some code that might allocate memory, then: */
*i = j;  /* Use deallocated memory. */

Here, *i is referring to memory that has been deallocated. When you deallocate memory, it goes back into the pool from which malloc draws, so the memory that i points to could have been reallocated and currently in use for some other purpose.

This is called the dangling pointer problem: the pointer can point to invalid memory.

Stack allocation and dangling pointers

C does have one type of implicitly deallocated data. Because C has explicit indirection --- not all values are pointers --- C can store local data on the stack. In the C declaration:

struct point_t * f() {
    struct point_t p = {0, 1};
    struct point_t * q =
        (struct point_t *)malloc(sizeof(struct point_t));
    q->x = 2;
    q->y = 3;
    ...
    return &p;
}

$[activation record of f; p has x=0, y=1 struct point_t inline, whereas q is a pointer to a heap-allocated value {2, 3}]$

Fig. 1: Activation record of f, showing distinction between p and q.

the value p is stored in the activation record for f, not as a pointer to a value in the heap. By constrast, the value q is a pointer to a dynamically allocated heap value, akin to an ML record {x=2, y=3}. See Fig. 1.

When a C function returns, its activation record is deallocated, just as in ML or Java*. However, the ability to store values directly on the heap, combined with C's & (address-of) operator, makes this form of automatic deallocation as unsafe as malloc/free.

Consider f, above: it returns the address of p, which will no longer be valid when the client accesses it. Hence, any function that saves the return value of f will have a dangling pointer:

struct point_t * dangling = f();

* Or, indeed, any language without first-class continuations.

Allocation is not necessarily private

Deallocation is unsafe, and C has other features (like casts) that allow you to corrupt arbitrary locations in memory. Therefore, when you receive a pointer to "fresh" memory from malloc, you actually have no guarantee that there are no other pointers into this data.

When you program in an unsafe language, no memory in the current process is ever truly private.

Why should allocation and deallocation be unsafe?

The short answer: there's no particularly good reason for allocation to be unsafe. In fact, C++, which aspires to be no more "safe" than C in most respects, replaces malloc with the new operator --- which takes a type instead of a raw size, and returns an appropriately typed pointer.

That leaves us with deallocation. As our free dangling pointer example shows, a language must be garbage collected in order to have safe deallocation.* If a programmer can reclaim arbitrary heap data manually, then the programmer can reclaim data that will be used again.

For many years, garbage collection was popularly perceived to be too impractical and slow for real applications. And, to some extent, this perception was true: machines were slow and primitive, and high-performance implementations of garbage collection were not widely available for systems that people wanted to use.

However, sometime on or around 1994, the world changed. For various reasons, the introduction of Java succeeded in convincing programmers and managers that garbage-collected languages had acceptabe performance costs. One of the major reasons is that programmer productivity, reliability, and security are now much more important than raw speed for most applications.

Manual memory management (using free or its equivalents) is still needed for certain niche, low-level applications. For example, for various reasons operating system kernels generally cannot be written in purely garbage-collected languages. However, only a small fraction of programmers write such software.

* This is true for arbitrary heap-allocated data --- we'll see on Wednesday that, provided you can live with some restrictions, you can regain type safety without full garbage collection.) For many years, garbage collection was popularly perceived to be too impractical and slow for real applications.

Arrays

We've only briefly discussed arrays and vectors in this class. However, all the languages we've discussed so far have safe arrays: you cannot read or write beyond the end of an array. This is enforced using dynamic checks: when you write the Smalltalk expression

anArray at: 123

Smalltalk will verify that the array has at least 123 elements before returning the 123rd. This feature is called array bounds checking.

C does not have automatic array bounds checking. The programmer must therefore check manually. that an index is within bounds before accessing array elements.

Of all the things we'll discuss today, this is probably the most important w.r.t. code exploits: most C program security holes are "buffer overruns", i.e. cases when a program forgets to check some the bounds for some buffer (array) and overruns the end.

In C, the following code will not raise a dynamic error:

int i_arr[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
i_arr[20] = 0xBADCAT;

In fact, even the following code will not raise an error:

i_arr[-8] = 0xBADCAT;

Some compilers may warn about a negative index when it's a direct constant, but this is a dumb check that can be circumvented trivially:

int j = readSomeUserInput();
i_arr[j] = 0xBADCAT;

In reality, all memory inside a process is one big linear array. Allowing code to read or write out-of-bounds array offsets amounts to allowing it to read or write any memory location, with corrupting results similar to those for casts.

This is especially bad for locally allocated arrays, because these arrays are stored into the local activation record, which contains the return address of the current activation. The return address points to the code that called the current function. When the current function completes execution, it will jump to the return address. By storing into a selected offset from a local variable, you can overwrite the return address, and make the program jump to arbitrary code.

Null-termination

The above problem with arrays is exacerbated by the fact that C has a peculiar (but useful) way of representing strings. In most high-level languages, arrays are represented in memory by a length field, followed by an array of characters of appropriate size, as in Fig. 2(a).

In C, strings (arrays of bytes) are often represented by an array of characters terminated by the null character '\0', which is the byte with the bit-pattern consisting of only zeros. See Fig. 2(b).

$[high-level language array: [27,'h','e',...,'\n']; C null-terminated array: ['h','e',...,'\n','\0',?,...?] (with garbage characters following null$

Fig. 2: Representation of immutable string in a high-level language; representation of null-terminated string in C.

Many C library routines for manipulating strings, like strcpy "string copy") take null-terminated arrays as arguments. The problem is that these functions do not take maximum lengths as arguments. If strcpy is given arrays of mismatched length, it will happily copy its data onto the shorter array, overwriting the end of the array:

/* An array of five uninitialized characters */
char arr[5];

/* Literal strings have an implicit null-terminator. */
char arr_2[] = "world..............long data!";

/* strcpy copies from its second argument onto its first
   until it hits the null-terminator at the end of the second. */
strcpy(arr, arr2);

Most array-bounds attacks are based on feeding some carefully crafted oversize string as input to a program (e.g., a web server), which leads to an inadvertent buffer overrun when that string gets copied somewhere else. The string then overwrites some "good" code with malicious code. Once that code gets executed, the attack is complete.

There are "safer" alternatives to the unsafe C functions (e.g., strncpy, which takes an additional parameter indicating the maximum length of the target string) but programmers still use the old functions; and new functions dealing with null-terminators are written all the time. Fundamentally, the problem is that the string does not carry the length of its buffer, so the bounds cannot be checked, and the programmer must check them manually.

Why else do C programmers use casts?

C has no parametric polymorphism

In ML, if you were to write a function that sorts a list, it would probably have the type:

val sort : 'a list -> ('a * 'a -> bool) -> 'a list

That is, it would take a list of some type, and a function that compared the elements of the list for ordering (i.e., a "greater than" function), and return a fresh list.

This type relies on parametric polymorphism --- it has the type variable 'a. This type tells ML's type system to verify that the arguments of the comparison function have the same type as the elements of the list.

C has no parametric polymorphism. But C does have a generic sorting function for arrays, which has the following profile:

void qsort(void * base,
                int nmemb,
                size_t size,
                int (*compar) (void *, void *));

The syntax is rather cryptic (and I've actually cleaned this up slightly for presentation). Let's go through each argument in turn:

void *base: this is a pointer to an array of items. Since qsort is generic over many types of lists, we don't have a specific array type, only void*.
nmemb is a count of members in the array --- i.e., the array size.
size is the size of each element in the array. qsort needs this because it will use this size to determine what offsets to pass to the comparison function.
compar is a function that, given two pointers into the array that starts at address base, will return an integer code indicating what ordering to assume over these array elements.

Here's an example of using this function:

int double_compar(double * d1, double * d2) {
    if (*d1 > *d2) {
        return 1;
    } else if (*d1 < *d2) {
        return -1;
    } else {
        return 0;
    }
}

void test() {
    double[10] d_arr = ...;

    /* Sorts elements of d_arr in place.
       In this call, d_arr is implicitly cast to void*,
       and double_compar is implicitly cast to
       int (*) (void*,void*). */
    qsort(d_arr, 10, sizeof(double), double_compar);
}

As the comments state, this call relies on implicit casts to void*. There is nothing preventing us from passing in mismatched functions, or incorrect element sizes (or, for that matter, incorrect array bounds):

int char_compar(char* c1, char* c2) {
    ...
}

void test2() {
    double d[10] d_arr = ...;

    /* This call is wrong for at least three reasons. */
    qsort(d_arr, 20, sizeof(int), char_compar);
}

C has no subtype polymorphism

C structs do not support subtyping of any kind. Therefore, the following two structs are not assignable:

struct point_t {
    int x;
    int y;
};

struct point3d_t {
    int x;
    int y;
    int z;
};

In order to simulate subtyping in C, programmers use casts on the pointer types:

point3d_t p = { 0, 1, 2 };
point_t *q = (point_t *)&p;
int qx = q->x;
int qy = q->y;

This works because C guarantees that structs with identical prefixes of fields will place their fields at the same offset from the beginning of the struct value.

Lessons

Big lessons of C:

Unrestricted manual memory management is unsafe.
(Corollary: Garbage collection is a big win.)
The ability to allocate data directly on the stack, and then take its address, is unsafe.
(Corollary: The uniform reference model of data is a big win.)
Unchecked array accesses are unsafe.
(Corollary: Array bounds checking is a big win.)
Polymorphic types are not merely an academic feature --- if you don't have subtyping and parametric polymorphism, you will need either code duplication or some form of casts. Both duplication and casts lead to bugs, and casts are unsafe.

On Wednesday, we'll learn that sophisticated language technology can complicate the first two of the above points. Cyclone, a research language for "C-level programming", combines type safety with the ability to perform low-level memory management and use stack-allocated data.