Notes on Code Generation for D Hal Perkins 5/99 This is a sketch of the assembly language code that needs to be generated for the different kinds of statements and expressions in D. You'll need to work out the details, but I hope this makes the general organization reasonably clear. There are some suggestions at the end about how to get started and what to do first. Useful Methods for Code Generation You'll find it handy to add a couple of methods to your parser/code generator to take care of some details needed throughout the compiler. Suggestions: // write code string c to the assembly language output file void gen(String c) { ... } // yield a different label every time this method is evaluated String newLabel() { ... } To implement the control-flow statements (if, while), we need to generate conditional or unconditional jumps to different labels for each new loop or conditional statement. The idea behind newLabel is that each time it is called, it should return a new string, say "L1", "L2", "L3", etc., that can be used in the assembled code. Rule about Expression Results To simplify things, generated code should always ensure the following: The result of any expression or subexpression, or the value returned by a function, should wind up in eax after the code for that expression, subexpression, or function call has finished execution. Stack Frames and Register Usage Use the regular x86 Win32 C calling and register conventions (see notes on x86 assembly language), with one major exception. If it simplifies things, feel free to push function arguments onto the stack from left to right as you compile them, instead of the standard right to left. Declaration Processing Functions (Global Symbol Table) There really isn't much processing to do here, because we will let the assembler/loader handle issues of resolving labels in the code to actual memory locations. The main reason for the global symbol table is to enable checks for things like multiple definitions of functions, inconsistent number of parameters in different calls to the same function, etc., to the extent that you want to do these. Parameters and Local Variables (Local Symbol Table) The key thing needed is to determine offsets in the function's stack frame for each parameter and local variable. This information should be stored in the local symbol table for the current function. When a variable is encountered in a D statement or expression, use the offset found in the symbol table to generate the address needed to reference the variable [ebp+offset]. Function Definitions (including main) Code generated for a function definition consists of a prologue to set up the stack frame, followed by the code for the statements in the function body. One particular note, the routine that compiles a function-def does not need to worry about the code generated to return from the function; that's handled in the return statement. Code generated for a function definition: gen(function_label); // label first instruction gen(push ebp) // save and set up frame ptr gen(mov ebp,esp) gen(sub esp,#_bytes_for_locals) // allocate space for locals Return Statement The code for a return statement needs to restore ebp and esp, then emit a return instruction. Note that the caller is responsible for popping the arguments, so we don't need to know how many arguments the function has to compile return. Code generated for the return expression should leave the result in eax, where we want it, so the actual code generated by the return statement doesn't need to do anything additional to handle the result. generate code for return exp // leave return value in eax gen(mov esp,ebp); // free local vars gen(pop ebp); // restore frame pointer gen(ret); Function Call The code to call a function needs to push the arguments, then jump to the label at the beginning of the code generated for the function. After the function returns, pop the arguments from the stack. evaluate function arguments (exps) and push them; gen(call function_label); // call function gen(add esp,#_bytes_of_arguments); // pop arguments Function Labels One gotcha here. In MASM, the name of a legal opcode cannot be used as a label in the assembly language program. That restriction is used by MASM to decide if the first thing it encounters on an input line is a label or an opcode without a preceding label. That means we can't generate function labels that have the same name as an opcode. An easy way to avoid any conflict is to create an assembly language label by combining the D function name with some unique suffix. A suitable label for a function named glorp might be glorp$D. Exception: The label generated at the beginning of function main must be D_main, not main, or main$D, or anything else. Standard Functions The get and put functions are called like any others. Add get and put to the global (function) symbol table by hand when initializing the parser. Once that's done, calls to these functions should compile exactly like any ordinary function. Assignment Statement id = exp; exp(); // compile exp; result in eax gen(mov [ebp+offset],eax // copy result into id location Expressions The code for a binary expression needs to evaluate both operands, then calculate the result and leave it in eax. Since evaluation of each argument will leave a value in eax, overwriting whatever was there previously, we need to temporarily store the left argument while the right one is evaluated. Pushing it onto the stack is the easiest way to handle this Once the right argument is in eax, pop the left argument (don't leave it on the stack), and compute the result. Example: Compile factor*factor: factor(); // calculate left operand in eax gen(push eax); factor() // right operand in eax gen(pop ecx); // remove left operand from stack gen(imul eax,ecx); // generate product in eax Gotcha: Watch out for non-commutative arithmetic operands i.e., a-b, and be sure to calculate a-b, not b-a. Factor The tricky case here is how should the compiler handle an id? Answer: If it's a name in the local symbol table, meaning it is a parameter or local variable, generate code to load it into eax. Otherwise, assume it is a function name and generate code to evaluate and push its arguments, then call it. Code generated for an integer constant should just load the constant into eax (use mov). Control Flow This is where we need new, unique labels in the code generated for each while or if. While For the statement while (cond) stmt, create two new labels L1 and L2, then produce the following: L1: evaluate cond jmp if false to L2 jmp L1 L2: If For the statement if (cond) stmt1 else stmt2, generate two new labels La and Lb, and generate the following evaluate cond jmp if false to La jmp Lb La: Lb: If there's no else part, the code for stmt2, Lb, and the jmp to Lb should be omitted. Conditional Expressions The last piece of the puzzle is how to handle conditional expressions, which appear in if and while statements. The basic idea is that to compile either bool-exp and rel-exp, we need two pieces of information: the label that is the target of the conditional jump, and a boolean argument, when, that controls the sense of the jump. If when is false, the compiled jump should be taken if the condition evaluates to false. That's the basic case. However, if when is true, the jump should be taken if the condition evaluates to true. How could when every be true? For a simple condition, it isn't. If we have while(a=b, i.e., when would be false here, indicating jump if the condition is false. Now look at while(!a