CSE 413 Spring 2000

Notes on Code Generation for D

Overview

This is a sketch of the assembly language code that needs to be generated for the different kinds of statements and expressions in D. You'll need to work out the details, but, with luck, this makes the overall organization of code generation in the compiler reasonably clear.

Here are a few general suggestions to keep in mind while implementing code generation.

Watch for things that need to be done repeatedly during code generation and define methods to do those things. One that is particularly useful is a method that writes a new line of generated code to the assembly language file:
```
   // write code string s to the assembly language output file
   void gen(String s) { ... }
```
This method probably just calls an appropriate println() function to write s. While you could call println() directly throughout the compiler, isolating the actual output statement(s) in a separate method isolates the rest of the compiler from details about the output stream.
Generate code as you parse the D program. Don't save up multiple lines of assembler output code in a huge string.
Don't worry about generating instructions that reload a just-stored variable, or do other stupid things. We're trying to get running, correct code.

Stack Frames and Register Usage

Use the regular x86 Win32 C calling and register conventions (see notes on x86 assembly language), with one major exception. It will simplify things if function arguments are pushed on the stack from left to right, not right to left as in the standard C calling conventions.

To simplify the handling of expressions, use the following rule: the code generated for any expression or subexpression (including a function call) must leave the result in register eax. A real compiler would have to make much more sophisticated use of the available registers, but this simplification will help keep code generation for this project from getting too complex.

Declaration Processing

Functions (Global Symbol Table)

There really isn't much processing to do here, because we will let the assembler and loader handle details of resolving labels in the code to actual memory locations. The main reason for the global symbol table is to allow checks for things like multiple definitions of functions, inconsistent number of parameters in different calls to the same function, etc. (to the extent that you want to do this).

Parameters and Local Variables (Local Symbol Table)

The key thing needed is to determine offsets in the function's stack frame for each parameter and local variable. This information should be stored in the local symbol table for the current function. When a variable or parameter is encountered in a D statement or expression, use the offset found in the symbol table entry to create the address needed to reference the variable or parameter [ebp+offset].

Expressions and Assignment

Expressions

The code for a binary arithmetic expression needs to evaluate both operands, then calculate the result and leave it in eax. Since evaluation of each argument changes eax, overwriting whatever was there previously, we need to temporarily store the left argument while the right one is evaluated. Pushing it onto the stack is the most straightforward way to do this. Once the right argument is in eax, pop the left argument (don't leave it on the stack), and compute the result.

Example: Code generation in term() for a multiplication operation will look something like this:

   factor();             // compile left operand; result in eax 
   gen(push eax);        // push onto stack
   factor();             // compile right operand; result in eax 
   gen(pop ecx);         // pop left operand from stack 
   gen(imul eax,ecx);    // final product in eax

Gotchas: Watch out for non-commutative arithmetic operands. When compiling a-b, be sure to generate code that calculates a-b, not b-a. Also, be sure that everything that is pushed onto the stack is popped eventually. Extra junk left on the stack will create problems when the compiled code runs.

Factor

There are several kinds of factors. Regardless of which one is being compiled, the generated code should leave the result in eax.

The trivial case is when the factor is a parenthesized expression. The recursive call to exp() that compiles the enclosed expression will generate all of the necessary code.

An integer constant is almost as simple. Generate a mov instruction to load the constant into eax.

The interesting case is how to handle a factor that begins with an identifier. Answer: If the identifier is a name in the local symbol table, meaning it is a parameter or local variable, generate code to load it into eax. Otherwise, assume it is a function call. Generate code to evaluate and push arguments, then call it (details below).

Assignment Statement id = exp;

   exp();                     // compile exp; result in eax 
   gen(mov [ebp+offset],eax); // copy result into variable or parameter

Function Definitions (including main)

Code generated for a function definition consists of a prologue to set up the stack frame, followed by the code for the statements in the function body. The routine that compiles a function-def does not need to generate code to return from the function; that's handled in the routine that compiles return statements.

Code for the beginning of a function definition

   gen(function_label:);            // generate label for function body
   gen(push ebp);                   // save ebp and set up frame pointer
   gen(mov ebp,esp); 
   gen(sub esp,#_bytes_for_locals); // allocate space for locals

   <generated code for function body here>

Function Labels

For simplicity, it would be nice if we could use the name of the function in the D source program to label the corresponding assembly language code. But there are two problems with this. First, MASM does not allow a machine instruction opcode to be used as a label. So if the D program contains functions named, for example, push, call, and mov, we can't use those names as labels in the generated code. Second, the D function main is not the real starting point for execution of the compiled code. Function main in the D program will actually be called by a C program that contains the real main (starting point), so the generated code can't be labeled main:.

For these reasons, we need to pick different names to label functions. We strongly suggest that you use the name of the original function concatenated to "d$" for the label name. So the generated code for D functions named max, pop, and glorp should be labeled d$max:, d$pop:, and d$glorp: in the MASM code. The generated code for D function main must be labeled d$main:.

Return Statement

Code generation for a return statement has two parts. First, compile the expression following keyword return by calling the exp() method in the compiler. When that code is executed at runtime, the value of the return expression will wind up in eax, which is where it needs to be. Then we can generate the actual function return. This code needs to restore ebp and esp, followed by a ret instruction. Note that the caller is responsible for popping the arguments, so we don't need to know how many arguments the function has when we compile return.

   exp();              // compile expression, leaves result in eax
   gen(mov esp,ebp);   // free local vars 
   gen(pop ebp);       // restore frame pointer 
   gen(ret);           // return to caller

Function Call

The code to call a function should push the arguments, then jump to the label at the beginning of the function's code. After the function returns, pop the arguments from the stack.

   // repeat the following for each function argument:
   exp();               // compile expression, leaves result in eax
   gen(push eax);

   // generate the actual function call 
   gen(call function_label);              // call function 
   gen(add esp,#_bytes_of_arguments);     // pop arguments

Standard Functions

The get and put functions are called like any others. Insert get and put into the global (function) symbol table by hand when initializing the parser. Once that's done, calls to these functions should compile exactly like any ordinary function. You should use get and put as the function names in the generated call instructions (not d$get and d$put). These are regular C functions defined elsewhere, not in your compiled code (details below).

Control Flow

New Label Generator

To implement the control-flow statements (if and while), we need to generate combinations of conditional and unconditional jumps to labels in the assembly code. Since a label can only appear once in a MASM program, each time we compile one of these statements we need a new set of label names.

We suggest that you include in your compiler a function that yields a new, distinct label name each time it is called.

   // yield a different label every time this method is evaluated
   String newLabel() { ... }

The idea is that each time newLabel() is executed, it should return a new string, say "L1", "L2", "L3", etc. The compiler routines for control flow can call it as needed to get unique labels.

while

To compile while (bool-exp) stmt, create two new labels, say L1 and L2, then generate code as follows:

   gen(L1:);               // generate label at top of loop  
   boolExp(...);           // compile condition  
   jmp if false to L2 
   stmt();                 // compile loop body 
   gen(jmp L1);            // branch to top of loop
   gen(L2:);               // generate label following loop

if

To compile if (bool-exp) stmt1 else stmt2, generate two new labels, say L3 and L4, then generate the following

   boolExp(...);          // compile condition 
   jmp if false to L3 
   stmt();                // compile stmt1 
   gen(jmp L4);		  // skip past stmt2
   gen(L3:);              // generate label at start of stmt2
   stmt();                // compile stmt2 
   gen(L4:);              // generate label following if statement

If there's no else part, the code for stmt2, L4, and the jmp to L4 should be omitted.

Conditional Expressions

The last piece of the puzzle is compiling boolean expressions. We'll sketch out here how this can be done for while loops. if statements can be handled similarly.

The basic idea is that the compiler methods for bool-exp and rel-exp need an additional parameter - the target label for the conditional jump. These routines then generate code to evaluate the conditional expression and generate a jump to the target label if the condition is false. Suppose we are translating

   while (x == y) stmt

If the label at the end of the loop is, say, "L123", then relExp("L123") should do something like this to compile the condition:

   exp();             // compile left operand (x); result in eax
   gen(push eax);     // save left operand on stack
   exp();             // compile right operand (y); result in eax
   gen(pop ecx);      // pop left operand into ecx
   gen(cmp ecx,eax);  // compare operands
   gen(jne L123);     // jump if condition false

That handles simple conditions. The next step is handling the logical not (!) operator. Suppose the condition in the loop is

   while (!(x == y)) stmt

and the target label at the bottom of the loop is again "L123". We need to generate exactly the same code as above, except that the sense of the test is reversed. Instead of jne to exit the loop if the expressions (x and y) have different values, we need to generate je to exit if the values are equal.

So relExp("L123") needs to generate a "jump if false" if the loop condition is simple, and a "jump if true" if the loop condition contains a logical not (!). We can get this flexibility by adding an extra parameter to relExp to control the sense of the test. Here is one possibility:

   // Compile rel-exp ::= exp==exp | exp>exp
   // If invert is false, generate a "jump false" to label
   // If invert is true, generate a "jump true" to label
   void relExp(String label, boolean invert) { ... }

The compiler function that processes bool-exp can then call relExp() with the second parameter value false if the condition is simple, and with the second parameter true if the bool-exp contains a logical not (!).