Until now, all the languages we've studied in this class have been type safe: it is not possible to corrupt the heap by using a piece of memory as an incorrect type.
However, a large fraction of the world's programming is done in C and C++, languages that are not type safe, for various reasons. Sometimes programmers need features that most high-level languages do not currently provide:
Often, programmers who believe they must program in unsafe languages are actually mistaken. For example, it is often the sufficient to implement certain performance-critical routines in C, and write the rest of the application in a high-level language.
But one might ask: why is it bad to program in C and C++, anyway? What makes these languages unsafe, and what does that mean? In these notes, we explore a few of C's unsafe features, and why programmers use them.
C has the roughly the same types of statements and control
structures (if
, while
, for
,
etc.) as Java (minus objects). We'll focus on the important
differences between C and Java, beginning with the type
system.
C's base types, whose sizes vary depending on the target machine, include the following:
char
: one byte; 8 bits on all modern machine
architectures. The name is somewhat deceptive:
char
is often used for raw machine bytes, not
characters per se, and conversely modern software often uses 16
bits for application-level characters (due to the needs of
multilingual software).
char c1 = 'a'; char c2 = 127; /* max signed character value on a standard machine */
int
: an integer, whose size depends on the
native word size on the target machine; on a 32-bit machine,
this is 32 bits (4 bytes), whereas on a 64-bit machine this is
usually 64 bits (8 bytes)float
and double
: the low- and
high-precision floating point sizes (on a 32-bit machine,
usually at least 4 and 8 words respectively, but this varies, as
with int
)short
,
long
, and others, which may or may not be different
from int
depending on the target machineC's compound types include (but are not limited to) the following:
Arrays: for any type T
, T[]
is an
array of T
. For example, int[]
is
an array of integers. Unlike Java, the declaration syntax
puts the brackets after the variable name, not on the
type:
int x[10]; /* array of 10 integers */ char buf[MAX_BUF]; /* array of MAX_BUF characters */ /* A function that takes a char array of unknown size. Unspecified-size arrays cannot be declared as locals; see below. */ void f(char x[], int len);
Records, called "structs", which are declared with the
struct
keyword and a list of field declarations
in curly braces:
struct StructTag { type1 fieldName1; type2 fieldName2; ... typeN fieldNameN; }
For example:
struct point_t { int x; int y; };
C uses nominal (name-based) type equivalence --- two different struct declarations are incompatible, even if they are identical. Two different struct declarations may not have the same type name.
Structure initialization statements take a list of initialization expressions for the fields between braces:
struct point_t origin = { 1, 0 }; /* initialize x to 1, and y to 0 */
Order of fields matters; the initialization expressions are assigned to the fields of the struct in turn.
Notice that we write struct point_t
above, not
just point_t
. Due to a quirk of C syntax,
declarations of struct values have the form struct
StructTag
, not StructTag
.
Unlike every other language we've studied so far, C uses explicit indirection --- rather than referring to all values by pointer, C distinguishes between direct values and indirect values:
For any T
, T*
is a pointer to a
value of type T
:
struct point_t *p_ptr = NULL; /* pointer to a point */ int ** int_ptr_ptr = NULL; /* pointer to a pointer to an int */
For any value, the address (the location where
the value is stored) can be obtained by applying the
&
operator:
struct point_t q = { 5, 6 }; /* Direct point_t value */ struct point_t *p = &q; /* Pointer to q */
For any pointer value, *expr
evaluates to the
value pointed-to. expr->fieldName
is a sugar for
evaluating fields of pointed-to structures:
int x_coord = (*p).x; /* Dereference/access of p */ int y_coord = p->y; /* Sugar for field access through struct pointers */
On the left-hand side of an assignment, *expr
evaluates to a location, which will be stored into:
struct point_t r = { 7, 8 }; (*p) = r; /* Assignment through pointer; updates p */
In safe languages, values are managed "from the cradle to the grave":
C does not have any of these protections:
Type-safe languages use some combination of static and dynamic checks to ensure that types cannot be violated. No program can write an integer into a string location. Statically typed languages like ML rule out such accesses by carefully designing the type system to prove that no such operation can occur. Dynamically typed languages rule out such accesses by tagging values with their types, and then checking operations at runtime to make sure they apply to type-correct arguments.
C is not a type-safe language. In C, any type can be viewed at any other type simply by using a cast:
(type)expr
This operation causes expr
to be treated as
type
, regardless of the type of
expr
.
Unlike Java casts, C casts are not dynamically checked
for "truthfulness" --- if the values are incompatible, the
compiler will "do the best it can", usually simply reinterpreting
the in-memory bit pattern of expr
as if it
belonged to type
.
For example, a pointer to a character can simply be cast to a pointer to a double:
void f(char* char_ptr) { double* d_ptr = (double*)char_ptr; (*d_ptr) = 3.5; }
A standard C-code "exploit" (malicious attack) involves writing some "bad" instructions into some location in memory that will later be executed. These instructions might, for example, store the payload of a virus. Casts provide a trivial way to do store arbitrary instructions into code addresses:
/* pretend 0x010000 is a hexadecimal address of code */ int i = 0x010000; /* treat this integer as a pointer, using a cast */ int * i_ptr = (int*)i; /* store through this pointer --- placing bad data at address */ (*i_ptr) = 0xBADCAT;
Later, when the code jumps to 0x010000
, the code
will execute the instruction 0xBADCAT
instead of
whatever instruction was there before.
As we shall see, in practice this particular "cast attack" is not how most C programs get exploited by malicious programmers. Malicious programmers are more likely to exploit array bounds errors. However, there are idiomatic uses of C code for which bad casts often cause inadvertent bugs (as opposed to malicious ones).
All the languages we've studied previously have made some strong, implicit guarantees about dynamic memory allocation and deallocation:
C provides none of these guarantees:
malloc
In C, dynamic memory allocation is not built into the language;
rather, it is provided by a library routine named
malloc
, which has the following profile:
void * malloc(size_t size)
This profile bears some explaining:
void*
is C's type meaning "a pointer of unknown
type" --- one might call it "void-star-polymorphism".
void*
resembles Object
, except that
there are no values of type void*
(it is purely a
placeholder type) and casts from void*
to some
other type are not checked dynamically.size_t
is C's type for the size of some memory
object. One usually obtains values of this type by applying the
special "function" sizeof
, which determines the
size of a given type at compile time.So, malloc
takes a memory size, and returns a
chunk of memory of the requested size. Since malloc
returns void*
, the client programmer will typically
cast its result to some useful type:
struct point_t * fresh_point = (struct point_t *)malloc(sizeof(struct point_t));
This newly allocated memory is uninitialized --- it is not even zero-filled --- so the programmer must initialize it:
fresh_point->x = 0; fresh_point->y = 1;
Additionally, the argument to malloc
is not
checked against the type to which its result is cast, so it is
possible to get it wrong:
struct point_t * pseudo_point = (struct point_t *)malloc(sizeof(char));
free
, stack allocationIn garbage-collected languages, deallocation is implicit. The programmer never explicitly tells the program to free the memory used by some object. This is (usually) good, because memory should only be reclaimed when it can never be used again, and programmers are not very good at figuring out by hand when memory can never be used again. Garbage collectors are better at this than humans: they only reclaim memory of objects they can prove will never be used.
C is not garbage collected. It has a free
function:
void free(void* mem_ptr)
free
deallocates the memory to which it
points. This has obvious bad implications; for example:
int *i = (int*)malloc(sizeof(int)); int j = 0; *i = 4; free(i); /* ... some code that might allocate memory, then: */ *i = j; /* Use deallocated memory. */
Here, *i
is referring to memory that has been
deallocated. When you deallocate memory, it goes back into the
pool from which malloc
draws, so the memory that
i
points to could have been reallocated and currently
in use for some other purpose.
This is called the dangling pointer problem: the pointer can point to invalid memory.
C does have one type of implicitly deallocated data. Because C has explicit indirection --- not all values are pointers --- C can store local data on the stack. In the C declaration:
struct point_t * f() { struct point_t p = {0, 1}; struct point_t * q = (struct point_t *)malloc(sizeof(struct point_t)); q->x = 2; q->y = 3; ... return &p; }
the value p
is stored in the activation record
for f
, not as a pointer to a value in the heap.
By constrast, the value q
is a pointer to a
dynamically allocated heap value, akin to an ML record {x=2,
y=3}
. See Fig. 1.
When a C function returns, its activation record is
deallocated, just as in ML or Java*. However, the ability to
store values directly on the heap, combined with C's &
(address-of) operator, makes this form of automatic deallocation
as unsafe as malloc
/free
.
Consider f
, above: it returns the address of
p
, which will no longer be valid when the client
accesses it. Hence, any function that saves the return value of
f
will have a dangling pointer:
struct point_t * dangling = f();
* Or, indeed, any language without first-class continuations.
Deallocation is unsafe, and C has other features (like casts)
that allow you to corrupt arbitrary locations in memory.
Therefore, when you receive a pointer to "fresh" memory from
malloc
, you actually have no guarantee that there are
no other pointers into this data.
When you program in an unsafe language, no memory in the current process is ever truly private.
The short answer: there's no particularly good reason for
allocation to be unsafe. In fact, C++, which aspires to be no
more "safe" than C in most respects, replaces malloc
with the new
operator --- which takes a type instead
of a raw size, and returns an appropriately typed pointer.
That leaves us with deallocation. As our free
dangling pointer example shows, a language must be garbage
collected in order to have safe deallocation.* If a programmer can
reclaim arbitrary heap data manually, then the programmer can
reclaim data that will be used again.
For many years, garbage collection was popularly perceived to be too impractical and slow for real applications. And, to some extent, this perception was true: machines were slow and primitive, and high-performance implementations of garbage collection were not widely available for systems that people wanted to use.
However, sometime on or around 1994, the world changed. For various reasons, the introduction of Java succeeded in convincing programmers and managers that garbage-collected languages had acceptabe performance costs. One of the major reasons is that programmer productivity, reliability, and security are now much more important than raw speed for most applications.
Manual memory management (using free
or its
equivalents) is still needed for certain niche, low-level
applications. For example, for various reasons operating system
kernels generally cannot be written in purely garbage-collected
languages. However, only a small fraction of programmers write
such software.
* This is true for arbitrary heap-allocated data --- we'll see on Wednesday that, provided you can live with some restrictions, you can regain type safety without full garbage collection.) For many years, garbage collection was popularly perceived to be too impractical and slow for real applications.
We've only briefly discussed arrays and vectors in this class. However, all the languages we've discussed so far have safe arrays: you cannot read or write beyond the end of an array. This is enforced using dynamic checks: when you write the Smalltalk expression
anArray at: 123
Smalltalk will verify that the array has at least 123 elements before returning the 123rd. This feature is called array bounds checking.
C does not have automatic array bounds checking. The programmer must therefore check manually. that an index is within bounds before accessing array elements.
Of all the things we'll discuss today, this is probably the most important w.r.t. code exploits: most C program security holes are "buffer overruns", i.e. cases when a program forgets to check some the bounds for some buffer (array) and overruns the end.
In C, the following code will not raise a dynamic error:
int i_arr[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; i_arr[20] = 0xBADCAT;
In fact, even the following code will not raise an error:
i_arr[-8] = 0xBADCAT;
Some compilers may warn about a negative index when it's a direct constant, but this is a dumb check that can be circumvented trivially:
int j = readSomeUserInput(); i_arr[j] = 0xBADCAT;
In reality, all memory inside a process is one big linear array. Allowing code to read or write out-of-bounds array offsets amounts to allowing it to read or write any memory location, with corrupting results similar to those for casts.
This is especially bad for locally allocated arrays, because these arrays are stored into the local activation record, which contains the return address of the current activation. The return address points to the code that called the current function. When the current function completes execution, it will jump to the return address. By storing into a selected offset from a local variable, you can overwrite the return address, and make the program jump to arbitrary code.
The above problem with arrays is exacerbated by the fact that C has a peculiar (but useful) way of representing strings. In most high-level languages, arrays are represented in memory by a length field, followed by an array of characters of appropriate size, as in Fig. 2(a).
In C, strings (arrays of bytes) are often represented by an
array of characters terminated by the null
character '\0'
, which is the byte with the
bit-pattern consisting of only zeros. See Fig. 2(b).
Many C library routines for manipulating strings, like
strcpy
"string copy") take null-terminated arrays as
arguments. The problem is that these functions do not take
maximum lengths as arguments. If strcpy
is given
arrays of mismatched length, it will happily copy its data
onto the shorter array, overwriting the end of the array:
/* An array of five uninitialized characters */ char arr[5]; /* Literal strings have an implicit null-terminator. */ char arr_2[] = "world..............long data!"; /* strcpy copies from its second argument onto its first until it hits the null-terminator at the end of the second. */ strcpy(arr, arr2);
Most array-bounds attacks are based on feeding some carefully crafted oversize string as input to a program (e.g., a web server), which leads to an inadvertent buffer overrun when that string gets copied somewhere else. The string then overwrites some "good" code with malicious code. Once that code gets executed, the attack is complete.
There are "safer" alternatives to the unsafe C functions (e.g.,
strncpy
, which takes an additional parameter
indicating the maximum length of the target string) but
programmers still use the old functions; and new functions dealing
with null-terminators are written all the time. Fundamentally,
the problem is that the string does not carry the length of its
buffer, so the bounds cannot be checked, and the programmer must
check them manually.
In ML, if you were to write a function that sorts a list, it would probably have the type:
val sort : 'a list -> ('a * 'a -> bool) -> 'a list
That is, it would take a list of some type, and a function that compared the elements of the list for ordering (i.e., a "greater than" function), and return a fresh list.
This type relies on parametric polymorphism --- it has the type
variable 'a
. This type tells ML's type system to
verify that the arguments of the comparison function have the same
type as the elements of the list.
C has no parametric polymorphism. But C does have a generic sorting function for arrays, which has the following profile:
void qsort(void * base, int nmemb, size_t size, int (*compar) (void *, void *));
The syntax is rather cryptic (and I've actually cleaned this up slightly for presentation). Let's go through each argument in turn:
void *base
: this is a pointer to an array of
items. Since qsort
is generic over many types of
lists, we don't have a specific array type, only
void*
.nmemb
is a count of members in the array ---
i.e., the array size.size
is the size of each element in
the array. qsort
needs this because it will use
this size to determine what offsets to pass to the comparison
function.compar
is a function that, given two pointers
into the array that starts at address base
, will
return an integer code indicating what ordering to assume over
these array elements.Here's an example of using this function:
int double_compar(double * d1, double * d2) { if (*d1 > *d2) { return 1; } else if (*d1 < *d2) { return -1; } else { return 0; } } void test() { double[10] d_arr = ...; /* Sorts elements of d_arr in place. In this call, d_arr is implicitly cast to void*, and double_compar is implicitly cast to int (*) (void*,void*). */ qsort(d_arr, 10, sizeof(double), double_compar); }
As the comments state, this call relies on implicit casts to
void*
. There is nothing preventing us from passing
in mismatched functions, or incorrect element sizes (or, for that
matter, incorrect array bounds):
int char_compar(char* c1, char* c2) { ... } void test2() { double d[10] d_arr = ...; /* This call is wrong for at least three reasons. */ qsort(d_arr, 20, sizeof(int), char_compar); }
C structs do not support subtyping of any kind. Therefore, the following two structs are not assignable:
struct point_t { int x; int y; }; struct point3d_t { int x; int y; int z; };
In order to simulate subtyping in C, programmers use casts on the pointer types:
point3d_t p = { 0, 1, 2 }; point_t *q = (point_t *)&p; int qx = q->x; int qy = q->y;
This works because C guarantees that structs with identical prefixes of fields will place their fields at the same offset from the beginning of the struct value.
Big lessons of C:
On Wednesday, we'll learn that sophisticated language technology can complicate the first two of the above points. Cyclone, a research language for "C-level programming", combines type safety with the ability to perform low-level memory management and use stack-allocated data.