This is a sketch of the assembly language code that needs to be generated for the different kinds of statements and expressions in D. You'll need to work out the details, but, with luck, this makes the overall organization of code generation in the compiler reasonably clear.
Most of this discussion uses an approximation of the standard Intel assembler
syntax. In particular, generic instruction names like mov
are
used instead of gcc
names
that include the data length like movl
, and two-operand instructions
show the destination on the left instead of the right. Information about the
format needed by the
GNU assembler as
, which is used by gcc
, appear in
a section towards the end.
Here are a few general suggestions to keep in mind while implementing code generation.
/* write code string s to the assembly language output file */ void gen(char* s) { ... } /* write an instruction with operation op and operands src and dst to the output */ void geninstr(char* op, char* src, char* dst) { ... }This functions probably just calls an appropriate
printf()
function to write the data. While you could call printf()
directly throughout the compiler, putting the actual output statement(s) in
separate functions isolates the rest of the compiler from details of the output
stream, and makes it easier to make changes if necessary.Use the regular x86 Win32 C calling and register conventions (see the notes on x86 assembly language), with one major exception. It will simplify the implementation if function arguments are pushed on the stack from left to right, not right to left as in the standard C calling conventions.
To simplify the handling of expressions, use the following invariant: when it
is executed at runtime, the code
generated for any expression or subexpression (including a function call) must
leave the result in register eax
. A real compiler would make much more sophisticated use of the available registers, but this
rule will greatly simplify code generation for this project.
There really isn't much processing to do here, because we will let the assembler and loader handle details of resolving labels in the code to actual memory locations. The main reason for the global symbol table is to allow checks for things like multiple definitions of functions, inconsistent number of parameters in different calls to the same function, etc. (to the extent that you want to do this).
The key thing needed is to determine offsets in the function's stack frame
for each parameter and local variable. This information should
be stored in the local symbol table for the current function. When a variable
or parameter
is
encountered in a D statement or expression, use the offset found in the symbol
table entry to
generate the address needed to reference the variable or parameter, i.e., ebp+offset
.
The code for a binary arithmetic expression needs to evaluate both operands,
then
calculate the result and leave it in eax
. Since evaluation of each
operand changes eax
, overwriting whatever was there previously,
we
need to temporarily store the left operand while the right one is evaluated.
Pushing
it
onto the stack is the most straightforward way to do this. Once the right operand
is in
eax
, pop the left operand (don't leave it on the stack), and compute
the
result.
Example: Code generation in term()
for a multiplication
operation will look something like this:
factor(); // compile left operand; result in eax gen(push eax); // push onto stack factor(); // compile right operand; result in eax gen(pop ecx); // pop left operand from stack gen(imul eax,ecx); // final product in eax
Gotchas: Watch out for non-commutative arithmetic operations. When
compiling a-b
, be
sure to generate code that calculates a-b
, not b-a
. Also, be sure that
everything that is pushed onto the stack is popped eventually. Extra junk
left on the stack will create havoc when the compiled code is executed.
There are several kinds of factors.
Regardless of which one is being compiled, the generated code should leave the result in eax
.
The trivial case is when the factor is a parenthesized expression. The
recursive call to exp()
that compiles the enclosed expression will generate all of the necessary
code.
An integer constant is almost as simple. Generate a mov
instruction to load the constant into eax
.
The interesting case is how to handle a factor that begins with an
identifier. Answer: If the identifier is
a name in the local symbol table, meaning it is a parameter or local variable,
generate code to load it into eax
. Otherwise, assume it is a function
call. Generate code to evaluate and push arguments, then call it (details
below).
exp(); // compile exp; result in eax gen(mov [ebp+offset],eax); // copy result into variable or parameter
Code generated for a function definition consists of a prologue to set up the
stack frame, followed by the code for the statements in the function body. The
function that compiles a
function-def should not generate code to return from the function;
that should be handled in the
function that compiles return
statements.
gen(function_label:); // generate label for function body gen(push ebp); // save ebp and set up frame pointer gen(mov ebp,esp); gen(sub esp,#_bytes_for_locals); // allocate space for locals <generated code for function body follows>
For the most part, we can use the name of the function as a label in the assembly
language code. However, the D function main
is
not the real starting point for execution of the compiled code. Function main
in
the D program will actually be called by a C program that contains the real main
(starting
point),
so the generated code for the D main function can't use that name. Instead,
the main function should begin with the label d$main:
so
it will link properly with the bootstrap code.
Code generation for a return
statement has two parts. First, compile the expression following keyword return
by calling the exp()
function in the compiler. When that
code is executed at runtime, the value of the return expression will wind up in eax
, which is
where it needs to be. Then we can generate the actual function
return. This code needs to restore ebp
and esp
,
then execute a ret
instruction. Note that the caller is responsible for popping the
arguments, so we don't need to know how many arguments the function has when we compile
a return
statement.
exp(); // compile expression, leaves result in eax gen(mov esp,ebp); // free local vars (restore esp) gen(pop ebp); // restore frame pointer (ebp) gen(ret); // return to caller
The code to call a function should push the arguments, then jump to the label at the beginning of the function's code with a call instruction. After the function returns, pop the arguments from the stack.
// repeat the following for each function argument: exp(); // compile expression, leaves result in eax gen(push eax); // generate the actual function call gen(call function_label); // call function gen(add esp,#_bytes_of_arguments); // pop arguments
The get
and put
functions are called like any
others.
Insert get
and put
into the
global (function) symbol table by hand when initializing the parser module,
before starting to parse the program. Once that's
done, calls to these functions should compile exactly like any other
function. You should use get
and put
as the
function names in the generated call
instructions. These are
regular C functions defined in the bootstrap program,
not in your compiled code, and will be found when the program is linked.
To implement the control-flow statements (if
and while
),
we need to generate combinations of conditional and unconditional jumps to
labels in the assembly
code. Since a label can only appear once in an assembler program, each
time
we
compile one of these statements we need a new set of label names.
We suggest that you include in your compiler a function that yields a new, distinct label name each time it is called.
/* store in lbl a different label every time this function is called */ void newLabel(char* lbl) { ... }
The idea is that each time newLabel()
is executed, it should
store in lbl
a new string, say "L1
", "L2
",
"L3
", etc. The compiler routines for control flow
can call this function as needed to create unique labels.
To compile while (bool-exp) stmt
, create two new labels, say L1
and
L2
, then generate code as follows:
gen(L1:); // generate label at top of loop boolExp(...); // compile condition jmp if false to L2 stmt(); // compile loop body gen(jmp L1); // branch to top of loop gen(L2:); // generate label following loop
To compile if (bool-exp) stmt1 else stmt2
, generate two new labels,
say L3
and L4
, then generate the following
boolExp(...); // compile condition jmp if false to L3 stmt(); // compile stmt1 gen(jmp L4); // skip past stmt2 gen(L3:); // generate label at start of stmt2 stmt(); // compile stmt2 gen(L4:); // generate label following if statement
If there's no else
part, the code for stmt2
, L4
, and the jmp to
L4
should be
omitted.
The last piece of the puzzle is compiling boolean expressions. We'll sketch out here how this can be done for while
loops. The conditions in if
statements can be handled similarly.
The basic idea is that the compiler functions for bool-exp and rel-exp need an additional parameter - the target label for the conditional jump. These routines then generate code to evaluate the conditional expression and generate a jump to the target label if the condition is false. Suppose we are translating
while (x == y) stmtIf the label at the end of the loop is, say,
"L123"
, then relExp("L123")
should do something like this to compile the condition:exp(); // compile left operand (x); result in eax gen(push eax); // save left operand on stack exp(); // compile right operand (y); result in eax gen(pop ecx); // pop left operand into ecx gen(cmp ecx,eax); // compare operands gen(jne L123); // jump if condition false
That handles simple conditions. The next step is handling
the logical not (!
) operator. Suppose the condition in the
loop is
while (!(x == y)) stmtand the target label at the bottom of the loop is again
"L123"
. We need to generate
exactly the same code as above, except that the sense of the test is
reversed. Instead of jne
to exit the loop if the expressions
(x
and y
) have different values, we need to generate je
to exit if the
values are equal.So relExp("L123")
needs to generate a "jump if
false" if the loop condition is simple, and a "jump if true" if
the loop condition contains a logical not (!
). We can get
this flexibility by adding an extra parameter to relExp
to control
the sense of the test. Here is one possibility:
// Compile rel-exp ::= exp==exp | exp>exp // If invert is false, generate a "jump false" to label // If invert is true, generate a "jump true" to label void relExp(char* label, int invert) { ... }
The compiler function that processes bool-exp can then call relExp()
with the second parameter value false
if the condition is simple,
and with the second parameter true
if the bool-exp contains
a logical not (!
).
The gcc
compiler generates assembly language code that is then translated
to machine language by the assembler as
. We can also use as
to
translate hand-written or, in our case, compiler-written text files containing
assembly language code.
The format of an assembly language file, for the purposes of this project, is quite simple. The file should begin with the following two lines
.text .globl d$main
The first line specifies that the following lines contain code, not data,
and the second line
specifies that the function label d$main
should be made available
globally as
an external function, so it can be called by the C bootstrap routine.
The rest of the code generated by your compiler should appear after these two
lines.
For historical reasons, as
uses a syntax for x86 assembly language
that is inherited from AT&T Unix and is not the same as the standard syntax
in all of the Intel and Microsoft documentation. The changes needed are slightly
annoying, but not too difficult. Here is a summary of the ones we need for
this project.
Intel/Microsoft |
AT&T/as |
|
order of operands |
op a,b means a = a op b (dst first) |
op a,b means b = a op b (dst last) |
memory addresses |
[baseregister+offset] |
offset(baseregister) |
instruction names |
mov , add , push , etc. |
include size as a suffix; movl , addl ,
pushl , etc. for 32-bit longwords |
register names |
eax , ebx , ebp , esp , etc. |
%eax , %ebx , %ebp , %esp , etc. |
constants |
17 , 42 |
$17 , $42 |
The assembly language file can contain comments, which for this project
should include the source program lines written by the io.c
routines.
There is a change you need to make. Comments in as source files
can be formatted using either the regular C /* ... */
syntax,
or by placing a # character at the start of the comment. As originally
distributed, the io.c routines print input lines with a ";" at
the beginning. You should change the initial value of the string prefix
in io.c
to "# "
instead of "; ".