CSE 451 Autumn 2004: Linux Memory Management

Some details about Linux memory management

In this web page, I'll try to give you enough of the truth about Linux memory management to get your projects done, without diving into every single detail (which would take several chapters of a book, such as is done in the book Understanding the Linux Kernel).

The x86 memory management architecture uses both segmentation and paging; we will cover both of these concepts in class in several weeks. Very roughly speaking, a segment is a partition of a process's address space that has its own protection policy. So, in the x86 architecture, it is possible to split the range of memory addresses that a process sees into multiple contiguous segments, and assign different protection modes to each. Paging is a technique for mapping small (usually 4KB) regions of a process's address space to chunks of real, physical memory. Paging thus controls how regions inside a segment are mapped onto physical RAM.

In Linux, the OS designers decided to carve up the 32 bit address space of each process in the following way:

All processes therefore have two segments (there are some details about additional segments that I'm hiding, but that's ok for now):

one segment (addresses 0x00000000 through 0xBFFFFFFF) for user-level, process-specific data such as the program's code, static data, heap, and stack. Every process has its own, independent user segment.
one segment (addresses 0xC0000000 through 0xFFFFFFFF), which contains kernel-specific data such as the kernel instructions, data, some stacks on which kernel code can execute, and more interestingly, a region in this segment is directly mapped to physical memory, so that the kernel can directly access physical memory locations without having to worry about address translation. (We'll talk about address translation later on in class.) The same kernel segment is mapped into every process, but processes can access it only when executing in protected kernel mode.

So, in user-mode (i.e., when a process is executing its program's instructions in user-mode), the process may only access addresses less than 0xC0000000; any access to an address higher than this results in a fault. However, when a user-mode process begins executing in the kernel (for instance, after having made a system call), the protection bit in the CPU is changed to supervisor mode (and some segmentation registers are changed), meaning that the process is thereby able to access addresses above 0xC0000000.

Because kernel-critical data structures and the mapping to physical memory are contained in the kernel segment, it is imperative that the user-level process can't cause the kernel to unwittingly read or write memory locations in this segment. Therefore, whenever the user-level process passes an address (e.g., a C reference) into the kernel through a system call, that address needs to be carefully checked. In particular, the kernel needs to make sure that all such addresses are below 0xC0000000. (There is one more check that must be made, namely that the addresses references a piece of the process's address space that has actually been mapped or allocated, but ignore this for now, since the kernel uses some clever tricks to do this check transparently to your system call implementation.)

How do you make sure that the address is legal? You could write the code to do the check for the boundary 0xC0000000 yourself, but then your system call implementation will contain assumptions that might change across versions of the linux kernel; if later versions of the linux kernel decide to add a third segment, or simply change the boundary between user and kernel segments by a few bytes, your code will suddenly become buggy. (In fact, this web page may already be out of date with respect to the current kernel, but the essence of the model is still correct.)

Instead, take a look at what other system call implementations (such as sys_gettimeofday()) do: they make use of various convenience routines to copy bytes to/from the user-level. These convenience routines do all of the checking on your behalf, using another convenience routine called access_ok(). Thus, if the kernel implementation changes, the kernel designers will modify these convenience routines to be correct, and your code will continue to work.