Linking and Loading: Lecture Highlights

The Problem

had macine instructions that encoded "absolute addresses" - the location in memory of operands and/or branch targets were written directly into instructions.
assumed while assembling that the program would be the only thing in memory, and that it would be loaded starting at address 0.

The second made it possible to live with the first - we could figure out at assembly time what address an operand was at, say, simply by pretending to load the entire program into memory and seeing where the operand ended up.

In a real system, a program isn't the only thing in memory - the OS is there, plus probably many other programs that are running concurrently. So, real programs must be relocatable - able to run correctly no matter what address range in memory they end up being loaded into.

The problem: How can we make that happen?
Solution overview: A cooperation among the assembler, linker, and loader, plus some help from the ISA.

The Portions of the Solution Talked About Today

We'll use an example in Cebollita. Real systems have more options, and so are more complicated, but the basic ideas are the same.

Here's what happens. (The file names in this image are links...)

Compiling transforms a program written in C to a functionally equivalent assembler program. Names in the C program just become names in the assembler program. (Compare extfunc.c and extfunc.s, for instance.)

Assembling transforms the symbolic instructions into hex. In Cebollita, all names are left unresolved, however. (Have a look at any of the branch instructions in main.s -- they don't have targets in main.o.) As well as containing the hex for the instructions, the .o ("object") files contain two important tables mapping names to addresses. The first, called SymTab in Cebollita, gives the address of each symbol defined in the file that produced it (e.g., main.s) - that is, it gives the offset within that file. (In fact, each file produces two segments: text contains instructions, and data contains data. The value of a symbol is its offset within the segment in which it is allocated - that's what SymTab shows.)

The other table, RefTab, is a list of all the places where symbols were used. Those places need to be "fixed up" by having appropriate addresses inserted into them sometime before execution can occur. For example, in main.o the RefTab says that there is a use of symbol localfunc at offset 144 (decimal, and in the text segment, although that information isn't actually displayed). That's offset 0x90, which in main.o contains

00000090 0x0c000000: jal 0x0

extfunc

Next comes the linker. It is provided with all the .o files needed to create the executable. It (a) decides on an order in which to place the .o's, then (b) using the SymTab's in each .o, computes new offsets for each symbol within this aggregation of the .o's, and then (c) using the RefTab's goes back and fixes up all the instructions that were waiting to have references resolved.

In a.out, extfunc.o has been loaded ahead of main.o. The instruction that is at offset 0x90 in main.o is now at offset 0xb4. The instruction there, after having been fixed (also called "patched") by the linker is

000000b4 0x0c000009: jal 0x9

localfunc

In Cebollita, all instructions are patched when the linker completes. That's because, like with SSI-x, it assumes the a.out file will be loaded into memory starting at address 0. If that weren't the case (as in the diagram above), some instructions would still need patching - the jal above, for instance, couldn't be resolved by the linker, because the actual address in memory of localfunc wouldn't be known until load time. The figure above shows this case.

What We Haven't Talked About Yet

The role of the ISA in all this
Local vs. global variables (in the C sense) - where are they allocated?
How procedure calls are done

Want More?

Real systems must satisfy a wide range of demands, though, and are more complicated. Here's a description of a real .o and .exe file format, Microsoft's Common Object File Format, in case you want to see all the details. (The keyword in Unix systems is "elf"; I haven't found a clear description of that format, though.)

zahorjan@cs.washington.edu