CSE 413 Autumn 2007

Notes on Code Generation for D

Overview

This is a sketch of the assembly language code that needs to be generated for the different kinds of statements and expressions in D. You'll need to work out the details, but, with luck, this makes the overall organization of code generation in the compiler reasonably clear.

Most of this discussion uses an approximation of the standard Intel assembler syntax. In particular, generic instruction names like mov are used instead of gcc names that include the data length like movl, and two-operand instructions show the destination on the left instead of the right. Information about the format needed by the GNU assembler as, which is used by gcc, appear in a section towards the end.

Here are a few general suggestions to keep in mind while implementing code generation.

Watch for things that need to be done repeatedly during code generation and define functions to do them. One useful idea is functions that write a new line of generated code to the assembly language file:
```
   /* write code string s and a comment to the assembly language output file */
   void gen(char* s, char* comment) { ... }

   /* write an instruction with operation op and operands src and dst followed by a comment to the output */
   void gen2opinstr(char* op, char* src, char* dst, char* comment) { ... }
```
These functions probably just call an appropriate printf() function to write the data. While you could call printf() directly throughout the compiler, putting the actual output statement(s) in separate functions isolates the rest of the compiler from details of the output stream, and makes it easier to make changes if necessary. These are just suggestions, write what works best for you. (You are not required to use the routines from io.c, and you are also free to modify the routines in io.c.)
It may also be useful to create more functions than you had in your parser originally. For example specialized functions to generate code for function calls or definitions, variable assignments or uses, etc.
Generate code as you parse the D program. Don't save up multiple lines of assembler output code in a huge string.
Don't worry about generating instructions that reload a just-stored variable, push something that was just popped, or do other fancy things. The goal here is to generate correct code, not optimize it.
Also, don't worry if some of the instructions for a particular line of source code are generated after the next source line has been read and printed in the assembly language code as a comment.

Stack Frames and Register Usage

Use the regular x86 Win32 C calling and register conventions (see the notes on x86 assembly language), with one major exception. It will simplify the implementation if function arguments are pushed on the stack from left to right, not right to left as in the standard C calling conventions.

To simplify the handling of expressions, use the following invariant: when it is executed at runtime, the code generated for any expression or subexpression (including a function call) must leave the result in register eax. A real compiler would make much more sophisticated use of the available registers, but this rule will greatly simplify code generation for this project.

Declaration Processing

Functions (Global Symbol Table)

There really isn't much processing to do here, because we will let the assembler and loader handle details of resolving labels in the code to actual memory locations. The main reason for the global symbol table is to allow checks for things like multiple definitions of functions, inconsistent number of parameters in different calls to the same function, etc. (to the extent that you want to do this).

Parameters and Local Variables (Local Symbol Table)

The key thing needed is to determine offsets in the function's stack frame for each parameter and local variable. Either the actual offset or information you will need to calculate the offset should be stored in the local symbol table for the current function. When a variable or parameter is encountered in a D statement or expression, use the offset found in the symbol table entry to generate the address needed to reference the variable or parameter, i.e., ebp+offset.

Expressions and Assignment

Expressions

The code for a binary arithmetic expression needs to evaluate both operands, then calculate the result and leave it in eax. Since evaluation of each operand changes eax, overwriting whatever was there previously, we need to temporarily store the left operand while the right one is evaluated. Pushing it onto the stack is the most straightforward way to do this. Once the right operand is in eax, pop the left operand (don't leave it on the stack), and compute the result.

Example: Code generation in term() for a multiplication operation will look something like this:

   factor();             // compile left operand; result in eax 
   gen(push eax);        // push onto stack
   factor();             // compile right operand; result in eax 
   gen(pop ecx);         // pop left operand from stack 
   gen(imul eax,ecx);    // final product in eax

Gotchas: Watch out for non-commutative arithmetic operations. When compiling a-b, be sure to generate code that calculates a-b, not b-a. Also, be sure that everything that is pushed onto the stack is popped eventually. Extra junk left on the stack will create havoc when the compiled code is executed.

Factor

There are several kinds of factors. Regardless of which one is being compiled, the generated code should leave the result in eax.

The trivial case is when the factor is a parenthesized expression. The recursive call to exp() that compiles the enclosed expression will generate all of the necessary code.

An integer constant is almost as simple. Generate a mov instruction to load the constant into eax.

The interesting case is how to handle a factor that begins with an identifier. Answer: If the identifier is a name in the local symbol table, meaning it is a parameter or local variable, generate code to load it into eax. Otherwise, assume it is a function call. Generate code to evaluate and push arguments, then call it (details below). In our case you can also peek ahead to see if a LPAREN is coming next.

Assignment Statement id = exp;

   exp();                     // compile exp; result in eax 
   gen(mov [ebp+offset],eax); // copy result into variable or parameter

Function Definitions (including main)

Code generated for a function definition consists of a prologue to set up the stack frame, followed by the code for the statements in the function body. The function that compiles a function-def should not generate code to return from the function; that should be handled in the function that compiles return statements.

Code for the beginning of a function definition

   gen(function_label:);            // generate label for function body
   gen(push ebp);                   // save ebp 
   gen(mov ebp,esp);                // set up frame pointer
   gen(sub esp,#_bytes_for_locals); // allocate space for locals

   <generated code for function body follows>

Function Labels

For the most part, we can use the name of the function as a label in the assembly language code. However, the D function main is not the real starting point for execution of the compiled code. Function main in the D program will actually be called by a C program that contains the real main (starting point), so the generated code for the D main function can't use that name. Instead, the main function should begin with the label _d$main: so it matches the name we specified in the bootstrap code in druntime.c. The prepended underscore character on _d$main is required for functions we will call from our C bootstrap program when using cygwin (which is only _d$main). Otherwise you can name the labels for functions other than main anything you would like (probably the name of the function is a good choice: foo, bar, etc. ).

Return Statement

Code generation for a return statement has two parts. First, compile the expression following keyword return by calling the exp() function in the compiler. When that code is executed at runtime, the value of the return expression will wind up in eax, which is where it needs to be. Then we can generate the actual function return. This code needs to restore ebp and esp, then execute a ret instruction. Note that the caller is responsible for popping the arguments, so we don't need to know how many arguments the function has when we compile a return statement.

   exp();              // compile expression, leaves result in eax
   gen(mov esp,ebp);   // free local vars (restore esp)
   gen(pop ebp);       // restore frame pointer (ebp)
   gen(ret);           // return to caller

Function Call

The code to call a function should push the arguments, then jump to the label at the beginning of the function's code with a call instruction. After the function returns, pop the arguments from the stack. In general you should call the function_label exactly as you generated it (eg. If for function foo you generated the label: foo:, you should call it as "call foo"). One exception to this rule occurs when using cygwin and calling the functions get and put defined in druntime.c. To call these you should prepend an underscore to the label: call _get or call _put.

   // repeat the following for each function argument:
   exp();               // compile expression, leaves result in eax
   gen(push eax);

   // generate the actual function call 
   gen(call function_label);              // call function 
   gen(add esp,#_bytes_of_arguments);     // pop arguments

Standard Functions

The get and put functions are called like any others (with the exception of the prepended underscore noted above). Insert get and put into the global (function) symbol table by hand when initializing the parser module, before starting to parse the program. Once that's done, calls to these functions should compile exactly like any other function. You should use _get and _put as the function names in the generated call instructions. These are regular C functions defined in the bootstrap program, not in your compiled code, and will be found when the program is linked. It is probably easiest to put the names get and put into the symbol table as this is what will be found in a D program trying to call these functions. Just remember the special case of prepending the underscore when you generate code.

Control Flow

New Label Generator

To implement the control-flow statements (if and while), we need to generate combinations of conditional and unconditional jumps to labels in the assembly code. Since a label can only appear once in an assembler program, each time we compile one of these statements we need a new set of label names.

We suggest that you include in your compiler a function that yields a new, distinct label name each time it is called.

   /* store in lbl a different label every time this function is called */
   void newLabel(char* lbl) { ... }

The idea is that each time newLabel() is executed, it should store in lbl a new string, say "L1", "L2", "L3", etc. The compiler routines for control flow can call this function as needed to create unique labels.

while

To compile while (bool-exp) stmt, create two new labels, say L1 and L2, then generate code as follows:

   gen(L1:);               // generate label at top of loop  
   boolExp(...);           // compile condition  
   jmp if false to L2 
   stmt();                 // compile loop body 
   gen(jmp L1);            // branch to top of loop
   gen(L2:);               // generate label following loop

if

To compile if (bool-exp) stmt1 else stmt2, generate two new labels, say L3 and L4, then generate the following

   boolExp(...);          // compile condition 
   jmp if false to L3 
   stmt();                // compile stmt1 
   gen(jmp L4);		  // skip past stmt2
   gen(L3:);              // generate label at start of stmt2
   stmt();                // compile stmt2 
   gen(L4:);              // generate label following if statement

If there's no else part, the code for stmt2, L4, and the jmp to L4 should be omitted.

Conditional Expressions

The last piece of the puzzle is compiling boolean expressions. We'll sketch out here how this can be done for while loops. The conditions in if statements can be handled similarly.

The basic idea is that the compiler functions for bool-exp and rel-exp need an additional parameter - the target label for the conditional jump. These routines then generate code to 1) evaluate the conditional expression and 2) generate a jump to the target label if the condition is *false*. Note that for both while loops and if statements the code that is required after evaluating the conditional expression is jumping if the condition is FALSE. All other jumps needed for those statements are unconditional jmp instructions. Suppose we are translating

 while (x == y) stmt

If the label at the end of the loop is, say, "L123", then relExp("L123") should do something like this to compile the condition:

   exp();             // compile left operand (x); result in eax
   gen(push eax);     // save left operand on stack
   exp();             // compile right operand (y); result in eax
   gen(pop ecx);      // pop left operand into ecx
   gen(cmp ecx,eax);  // compare operands
   gen(jne L123);     // jump if condition false

That handles simple conditions. The next step is handling the logical not (!) operator. Suppose the condition in the loop is

   while (!(x == y)) stmt

and the target label at the bottom of the loop is again "L123". We need to generate exactly the same code as above, except that the sense of the test is reversed. Instead of jne to exit the loop if the expressions (x and y) have different values, we need to generate je to exit if the values are equal.

So relExp("L123") needs to generate a "jump if false" if the loop condition is simple, and a "jump if true" if the loop condition contains a logical not (!). We can get this flexibility by adding an extra parameter to relExp to control the sense of the test. Here is one possibility:

   // Compile rel-exp ::= exp==exp | exp>exp
   // If invert is false, generate a "jump false" to label
   // If invert is true, generate a "jump true" to label
   void relExp(char* label, int invert) { ... }

The compiler function that processes bool-exp can then call relExp() with the second parameter value false if the condition is simple, and with the second parameter true if the bool-exp contains a logical not (!).

GNU Assembler (as)

The gcc compiler generates assembly language code that is then translated to machine language by the assembler as. We can also use as to translate hand-written or, in our case, compiler-written text files containing assembly language code.

The format of an assembly language file, for the purposes of this project, is quite simple. The file should begin with the following two lines:

      .text
      .globl _d$main

Then somewhere later on in the file (not necessarily as the first function in your file) you will have the code for the main function from your D program which should begin with the label _d$main:. The underscore is required in these two places if you will be running your code with cgywin.

The first line specifies that the following lines contain code, not data, and the second line specifies that the function label _d$main should be made available globally as an external function, so it can be called by the C bootstrap routine. The rest of the code generated by your compiler should appear after these two lines.

Intel vs. gcc/as Syntax

For historical reasons, as uses a syntax for x86 assembly language that is inherited from AT&T Unix and is not the same as the standard syntax in all of the Intel and Microsoft documentation. The changes needed are slightly annoying, but not too difficult. Here is a summary of the ones we need for this project.

	Intel/Microsoft	AT&T/as
order of operands	op a,b means a = a op b (dst first)	op a,b means b = a op b (dst last)
memory addresses	[baseregister+offset]	offset(baseregister)
instruction names	`mov`, `add`, `push`, etc.	include size as a suffix; `movl`, `addl`, `pushl`, etc. for 32-bit longwords
register names	`eax`, `ebx`, `ebp`, `esp`, etc.	`%eax`, `%ebx`, `%ebp`, `%esp`, etc.
constants	`17`, `42`	`$17`, `$42`

Comments

The assembly language file can contain comments, which for this project should include (at least) the source program lines written by the io.c routines. Comments in as source files can be formatted using either the regular C /* ... */ syntax, or by placing a # character at the start of the comment. The io.c routines as given print input lines with a prefix of "# " at the beginning, which means they will be interpreted as comments in the assembly file (what we want). Feel free to add other comments to the individual assembly instructions you generate to make it easier to follow.