This is a sketch of the assembly language code that needs to be generated for the different kinds of statements and expressions in D. You'll need to work out the details, but, with luck, this makes the overall organization of code generation in the compiler reasonably clear.
Most of this discussion uses an approximation of the standard Intel
assembler syntax. In particular, generic instruction names
like mov
are used instead of gcc
names
that include the data length like movl
, and two-operand
instructions show the destination on the left instead of the
right. Information about the format needed by the GNU
assembler as
, which is used by gcc
, appear
in a section towards the end.
Here are a few general suggestions to keep in mind while implementing code generation.
/* write code string s and a comment to the assembly language output file */ void gen(char* s, char* comment) { ... } /* write an instruction with operation op and operands src and dst followed by a comment to the output */ void gen2opinstr(char* op, char* src, char* dst, char* comment) { ... }These functions probably just call an appropriate
printf()
function to write the data. While
you could call printf()
directly throughout the
compiler, putting the actual output statement(s) in separate
functions isolates the rest of the compiler from details of the
output stream, and makes it easier to make changes if necessary.
These are just suggestions, write what works best for you. (You are
not required to use the routines from io.c
, and you
are also free to modify the routines in io.c
.)Use the regular x86 Win32 C calling and register conventions (see the notes on x86 assembly language), with one major exception. It will simplify the implementation if function arguments are pushed on the stack from left to right, not right to left as in the standard C calling conventions.
To simplify the handling of expressions, use the following
invariant: when it is executed at runtime, the code generated for
any expression or subexpression (including a function
call) must leave the result in register eax
.
A real compiler would make much more sophisticated use of the
available registers, but this rule will greatly simplify code
generation for this project.
There really isn't much processing to do here, because we will let the assembler and loader handle details of resolving labels in the code to actual memory locations. The main reason for the global symbol table is to allow checks for things like multiple definitions of functions, inconsistent number of parameters in different calls to the same function, etc. (to the extent that you want to do this).
The key thing needed is to determine offsets in the function's
stack frame for each parameter and local variable. Either the actual
offset or information you will need to calculate the offset should be
stored in the local symbol table for the current function. When a
variable or parameter is encountered in a D statement or expression,
use the offset found in the symbol table entry to generate the address
needed to reference the variable or parameter,
i.e., ebp+offset
.
The code for a binary arithmetic expression needs to evaluate both
operands, then calculate the result and leave it
in eax
. Since evaluation of each operand
changes eax
, overwriting whatever was there previously,
we need to temporarily store the left operand while the right one is
evaluated. Pushing it onto the stack is the most straightforward
way to do this. Once the right operand is in
eax
, pop the left operand (don't leave it on the stack), and compute
the
result.
Example: Code generation in term()
for a multiplication
operation will look something like this:
factor(); // compile left operand; result in eax gen(push eax); // push onto stack factor(); // compile right operand; result in eax gen(pop ecx); // pop left operand from stack gen(imul eax,ecx); // final product in eax
Gotchas: Watch out for non-commutative arithmetic operations.
When compiling a-b
, be sure to generate code that
calculates a-b
, not b-a
. Also, be sure that
everything that is pushed onto the stack is popped eventually.
Extra junk left on the stack will create havoc when the compiled code
is executed.
There are several kinds of factors. Regardless of
which one is being compiled, the generated code should leave the
result in eax
.
The trivial case is when the factor is a parenthesized
expression. The recursive call to exp()
that
compiles the enclosed expression will generate all of the necessary
code.
An integer constant is almost as simple. Generate
a mov
instruction to load the constant
into eax
.
The interesting case is how to handle a factor that begins
with an identifier. Answer: If the identifier is a name in the
local symbol table, meaning it is a parameter or local variable,
generate code to load it into eax
. Otherwise, assume it
is a function call. Generate code to evaluate and push arguments, then
call it (details below). In our case you can also peek ahead to see
if a LPAREN is coming next.
exp(); // compile exp; result in eax gen(mov [ebp+offset],eax); // copy result into variable or parameter
Code generated for a function definition consists of a prologue to
set up the stack frame, followed by the code for the statements in the
function body. The function that compiles a
function-def should not generate code to return from
the function; that should be handled in the function that
compiles return
statements.
gen(function_label:); // generate label for function body gen(push ebp); // save ebp gen(mov ebp,esp); // set up frame pointer gen(sub esp,#_bytes_for_locals); // allocate space for locals <generated code for function body follows>
For the most part, we can use the name of the function as a label
in the assembly language code. However, the D
function main
is not the real starting point for
execution of the compiled code. Function main
in
the D program will actually be called by a C program that contains
the real main
(starting point), so the generated code
for the D main function can't use that name. Instead, the main
function should begin with the label _d$main:
so it
matches the name we specified in the bootstrap code
in druntime.c
. The prepended underscore character
on _d$main
is required for functions we will call from
our C bootstrap program when using cygwin (which is
only _d$main
). Otherwise you can name the labels for
functions other than main anything you would like (probably the name
of the function is a good choice: foo, bar, etc. ).
Code generation for a return
statement has two
parts. First, compile the expression following
keyword return
by calling the exp()
function in the compiler. When that code is executed at
runtime, the value of the return expression will wind up
in eax
, which is where it needs to be. Then we can
generate the actual function return. This code needs to
restore ebp
and esp
, then execute
a ret
instruction. Note that the caller is responsible
for popping the arguments, so we don't need to know how many
arguments the function has when we compile a return
statement.
exp(); // compile expression, leaves result in eax gen(mov esp,ebp); // free local vars (restore esp) gen(pop ebp); // restore frame pointer (ebp) gen(ret); // return to caller
The code to call a function should push the arguments, then jump to
the label at the beginning of the function's code with
a call
instruction. After the function returns, pop the
arguments from the stack. In general you should call the
function_label exactly as you generated it (eg. If for
function foo
you generated the label: foo:
,
you should call it as "call foo"
). One exception to this
rule occurs when using cygwin and calling the functions get
and put
defined in druntime.c
. To call
these you should prepend an underscore to the label: call
_get
or call _put
.
// repeat the following for each function argument: exp(); // compile expression, leaves result in eax gen(push eax); // generate the actual function call gen(call function_label); // call function gen(add esp,#_bytes_of_arguments); // pop arguments
The get
and put
functions are called
like any others (with the exception of the prepended underscore
noted above). Insert get
and put
into
the global (function) symbol table by hand when initializing the
parser module, before starting to parse the program. Once that's
done, calls to these functions should compile exactly like any other
function. You should use _get
and _put
as the function names in the
generated call
instructions. These are regular C
functions defined in the bootstrap program, not in your compiled
code, and will be found when the program is linked. It is probably
easiest to put the names get
and put
into
the symbol table as this is what will be found in a D program trying
to call these functions. Just remember the special case of
prepending the underscore when you generate code.
To implement the control-flow statements (if
and while
), we need to generate combinations of
conditional and unconditional jumps to labels in the assembly
code. Since a label can only appear once in an assembler
program, each time we compile one of these statements we need a new
set of label names.
We suggest that you include in your compiler a function that yields a new, distinct label name each time it is called.
/* store in lbl a different label every time this function is called */ void newLabel(char* lbl) { ... }
The idea is that each time newLabel()
is executed, it should
store in lbl
a new string, say "L1
", "L2
",
"L3
", etc. The compiler routines for control flow
can call this function as needed to create unique labels.
To compile while (bool-exp) stmt
, create two new
labels, say L1
and
L2
, then generate code as follows:
gen(L1:); // generate label at top of loop boolExp(...); // compile condition jmp if false to L2 stmt(); // compile loop body gen(jmp L1); // branch to top of loop gen(L2:); // generate label following loop
To compile if (bool-exp) stmt1 else stmt2
, generate
two new labels, say L3
and L4
, then generate
the following
boolExp(...); // compile condition jmp if false to L3 stmt(); // compile stmt1 gen(jmp L4); // skip past stmt2 gen(L3:); // generate label at start of stmt2 stmt(); // compile stmt2 gen(L4:); // generate label following if statement
If there's no else
part, the code
for stmt2
, L4
, and the jmp to
L4
should be
omitted.
The last piece of the puzzle is compiling boolean
expressions. We'll sketch out here how this can be done
for while
loops. The conditions
in if
statements can be handled similarly.
The basic idea is that the compiler functions for bool-exp
and rel-exp need an additional parameter - the target label for
the conditional jump. These routines then generate code to 1)
evaluate the conditional expression and 2) generate a jump to the
target label if the condition is *false*. Note that for
both while
loops and if
statements the code
that is required after evaluating the conditional expression is
jumping if the condition is FALSE. All other jumps needed for those
statements are unconditional jmp
instructions.
Suppose we are translating
while (x == y) stmtIf the label at the end of the loop is, say,
"L123"
,
then relExp("L123")
should do something like
this to compile the condition:
exp(); // compile left operand (x); result in eax gen(push eax); // save left operand on stack exp(); // compile right operand (y); result in eax gen(pop ecx); // pop left operand into ecx gen(cmp ecx,eax); // compare operands gen(jne L123); // jump if condition false
That handles simple conditions. The next step is handling
the logical not (!
) operator. Suppose the condition in the
loop is
while (!(x == y)) stmtand the target label at the bottom of the loop is again
"L123"
. We need to generate
exactly the same code as above, except that the sense of the test is
reversed. Instead of jne
to exit the loop if the expressions
(x
and y
) have different values, we need to generate je
to exit if the
values are equal.So relExp("L123")
needs to generate a "jump if
false" if the loop condition is simple, and a "jump if true" if
the loop condition contains a logical not (!
). We can get
this flexibility by adding an extra parameter to relExp
to control
the sense of the test. Here is one possibility:
// Compile rel-exp ::= exp==exp | exp>exp // If invert is false, generate a "jump false" to label // If invert is true, generate a "jump true" to label void relExp(char* label, int invert) { ... }
The compiler function that processes bool-exp can then
call relExp()
with the second parameter
value false
if the condition is simple, and with the
second parameter true
if the bool-exp contains a
logical not (!
).
The gcc
compiler generates assembly language code that
is then translated to machine language by the
assembler as
. We can also use as
to
translate hand-written or, in our case, compiler-written text files
containing assembly language code.
The format of an assembly language file, for the purposes of this project, is quite simple. The file should begin with the following two lines:
.text .globl _d$main
Then somewhere later on in the file (not necessarily as the first
function in your file) you will have the code for
the main
function from your D program which should begin
with the label _d$main:
. The underscore is required in
these two places if you will be running your code with cgywin.
The first line specifies that the following lines contain code,
not data, and the second line specifies that the function
label _d$main
should be made available globally as an
external function, so it can be called by the C bootstrap routine.
The rest of the code generated by your compiler should appear
after these two lines.
For historical reasons, as
uses a syntax for x86
assembly language that is inherited from AT&T Unix and is not
the same as the standard syntax in all of the Intel and Microsoft
documentation. The changes needed are slightly annoying, but not
too difficult. Here is a summary of the ones we need for this
project.
Intel/Microsoft |
AT&T/as |
|
order of operands |
op a,b means a = a op b (dst first) |
op a,b means b = a op b (dst last) |
memory addresses |
[baseregister+offset] |
offset(baseregister) |
instruction names |
mov , add , push , etc. |
include size as a suffix; movl , addl ,
pushl , etc. for 32-bit longwords |
register names |
eax , ebx , ebp , esp , etc. |
%eax , %ebx , %ebp , %esp , etc. |
constants |
17 , 42 |
$17 , $42 |
The assembly language file can contain comments, which for this
project should include (at least) the source program lines written
by the io.c
routines. Comments in as
source files can be formatted using either the regular C /*
... */
syntax, or by placing a # character at the start of
the comment. The io.c
routines as given print input
lines with a prefix
of "# " at the
beginning, which means they will be interpreted as comments in the
assembly file (what we want). Feel free to add other comments to
the individual assembly instructions you generate to make it
easier to follow.