This is a sketch of the assembly language code that needs to be generated for the different kinds of statements and expressions in D. You'll need to work out the details, but, with luck, this makes the overall organization of code generation in the compiler reasonably clear.
Here are a few general suggestions to keep in mind while implementing code generation.
// write code string s to the assembly language output file void gen(String s) { ... }This method probably just calls an appropriate
println()
function to write s
. While you could call println()
directly throughout the compiler, isolating the actual output statement(s) in a
separate method isolates the rest of the compiler from details about the output
stream.Use the regular x86 Win32 C calling and register conventions (see notes on x86 assembly language), with one major exception. It will simplify things if function arguments are pushed on the stack from left to right, not right to left as in the standard C calling conventions.
To simplify the handling of expressions, use the following rule: the code
generated for any expression or subexpression (including a function call) must
leave the result in register eax
. A real compiler would
have to make much more sophisticated use of the available registers, but this
simplification will help keep code generation for this project from getting too complex.
There really isn't much processing to do here, because we will let the assembler and loader handle details of resolving labels in the code to actual memory locations. The main reason for the global symbol table is to allow checks for things like multiple definitions of functions, inconsistent number of parameters in different calls to the same function, etc. (to the extent that you want to do this).
The key thing needed is to determine offsets in the function's stack frame
for each parameter and local variable. This information should be stored in the
local symbol table for the current function. When a variable or parameter is encountered in a
D statement or expression, use the offset found in the symbol table entry to
create the address needed to reference the variable or parameter [ebp+offset]
.
The code for a binary arithmetic expression needs to evaluate both operands, then
calculate the result and leave it in eax
. Since evaluation of each argument changes eax
, overwriting whatever was there previously, we need to
temporarily store the left argument while the right one is evaluated. Pushing it
onto the stack is the most straightforward way to do this. Once the right argument is in
eax
, pop the left argument (don't leave it on the stack), and compute the
result.
Example: Code generation in term()
for a multiplication
operation will look something like this:
factor(); // compile left operand; result in eax gen(push eax); // push onto stack factor(); // compile right operand; result in eax gen(pop ecx); // pop left operand from stack gen(imul eax,ecx); // final product in eax
Gotchas: Watch out for non-commutative arithmetic operands. When
compiling a-b
, be
sure to generate code that calculates a-b
, not b-a
. Also, be sure that
everything that is pushed onto the stack is popped eventually. Extra junk
left on the stack will create problems when the compiled code runs.
There are several kinds of factors.
Regardless of which one is being compiled, the generated code should leave the result in eax
.
The trivial case is when the factor is a parenthesized expression. The
recursive call to exp()
that compiles the enclosed expression will generate all of the necessary
code.
An integer constant is almost as simple. Generate a mov
instruction to load the constant into eax
.
The interesting case is how to handle a factor that begins with an
identifier. Answer: If the identifier is
a name in the local symbol table, meaning it is a parameter or local variable,
generate code to load it into eax
. Otherwise, assume it is a function
call. Generate code to evaluate and push arguments, then call it (details
below).
exp(); // compile exp; result in eax gen(mov [ebp+offset],eax); // copy result into variable or parameter
Code generated for a function definition consists of a prologue to set up the
stack frame, followed by the code for the statements in the function body. The routine that compiles a
function-def does not need to generate code to return from the function; that's handled in the
routine that compiles return
statements.
gen(function_label:); // generate label for function body gen(push ebp); // save ebp and set up frame pointer gen(mov ebp,esp); gen(sub esp,#_bytes_for_locals); // allocate space for locals <generated code for function body here>
For simplicity, it would be nice if we could use the name of the function in
the D source program to label the corresponding assembly language code. But
there are two problems with this. First, MASM does not allow a machine
instruction opcode to be used as a label. So if the D program
contains functions named, for example, push
, call
, and
mov
, we can't use those names as labels in the generated
code. Second, the D function main
is not the real starting
point for execution of the compiled code. Function main
in the
D program will actually be called by a C program that contains the real main
(starting
point),
so the generated code can't be labeled main:
.
For these reasons, we need to pick different names to label functions. We strongly suggest that you use the name of the original
function concatenated to "d$" for the label name. So the
generated code for D functions named max
, pop
, and glorp
should be labeled d$max:
, d$pop:
, and d$glorp:
in the MASM code. The generated code for D function main
must
be labeled d$main:
.
Code generation for a return
statement has two parts. First, compile the expression following keyword return
by calling the exp()
method in the compiler. When that
code is executed at runtime, the value of the return expression will wind up in eax
, which is
where it needs to be. Then we can generate the actual function
return. This code needs to restore ebp
and esp
,
followed by a ret
instruction. Note that the caller is responsible for popping the
arguments, so we don't need to know how many arguments the function has when we compile
return
.
exp(); // compile expression, leaves result in eax gen(mov esp,ebp); // free local vars gen(pop ebp); // restore frame pointer gen(ret); // return to caller
The code to call a function should push the arguments, then jump to the label at the beginning of the function's code. After the function returns, pop the arguments from the stack.
// repeat the following for each function argument: exp(); // compile expression, leaves result in eax gen(push eax); // generate the actual function call gen(call function_label); // call function gen(add esp,#_bytes_of_arguments); // pop arguments
The get
and put
functions are called like any others.
Insert get
and put
into the
global (function) symbol table by hand when initializing the parser. Once that's
done, calls to these functions should compile exactly like any ordinary
function. You should use get
and put
as the
function names in the generated call
instructions (not d$get
and d$put
). These are regular C functions defined elsewhere,
not in your compiled code (details below).
To implement the control-flow statements (if
and while
), we need to generate
combinations of conditional and unconditional jumps to labels in the assembly
code. Since a label can only appear once in a MASM program, each time we
compile one of these statements we need a new set of label names.
We suggest that you include in your compiler a function that yields a new, distinct label name each time it is called.
// yield a different label every time this method is evaluated String newLabel() { ... }
The idea is that each time newLabel()
is executed, it should return a new string, say "L1
", "L2
",
"L3
", etc. The compiler routines for control flow
can call it as needed to get unique labels.
To compile while (bool-exp) stmt
, create two new labels, say L1
and
L2
, then generate code as follows:
gen(L1:); // generate label at top of loop boolExp(...); // compile condition jmp if false to L2 stmt(); // compile loop body gen(jmp L1); // branch to top of loop gen(L2:); // generate label following loop
To compile if (bool-exp) stmt1 else stmt2
, generate two new labels,
say L3
and L4
, then generate the following
boolExp(...); // compile condition jmp if false to L3 stmt(); // compile stmt1 gen(jmp L4); // skip past stmt2 gen(L3:); // generate label at start of stmt2 stmt(); // compile stmt2 gen(L4:); // generate label following if statement
If there's no else
part, the code for stmt2
, L4
, and the jmp to
L4
should be
omitted.
The last piece of the puzzle is compiling boolean expressions. We'll sketch out here how this can be done for while
loops. if
statements can be handled similarly.
The basic idea is that the compiler methods for bool-exp and rel-exp need an additional parameter - the target label for the conditional jump. These routines then generate code to evaluate the conditional expression and generate a jump to the target label if the condition is false. Suppose we are translating
while (x == y) stmtIf the label at the end of the loop is, say,
"L123"
, then relExp("L123")
should do something like this to compile the condition:exp(); // compile left operand (x); result in eax gen(push eax); // save left operand on stack exp(); // compile right operand (y); result in eax gen(pop ecx); // pop left operand into ecx gen(cmp ecx,eax); // compare operands gen(jne L123); // jump if condition false
That handles simple conditions. The next step is handling
the logical not (!
) operator. Suppose the condition in the
loop is
while (!(x == y)) stmtand the target label at the bottom of the loop is again
"L123"
. We need to generate
exactly the same code as above, except that the sense of the test is
reversed. Instead of jne
to exit the loop if the expressions
(x
and y
) have different values, we need to generate je
to exit if the
values are equal.So relExp("L123")
needs to generate a "jump if
false" if the loop condition is simple, and a "jump if true" if
the loop condition contains a logical not (!
). We can get
this flexibility by adding an extra parameter to relExp
to control
the sense of the test. Here is one possibility:
// Compile rel-exp ::= exp==exp | exp>exp // If invert is false, generate a "jump false" to label // If invert is true, generate a "jump true" to label void relExp(String label, boolean invert) { ... }
The compiler function that processes bool-exp can then call relExp()
with the second parameter value false
if the condition is simple,
and with the second parameter true
if the bool-exp contains
a logical not (!
).