The 32-bit subset of the x86 instruction set is a fairly clean assembly language.
It's difficult to tell this from the various Intel
and Microsoft documentation, because the x86 is backwards compatible
with 25-year old processors and operating systems, and all of that
is described in these sources. This note is an attempt to highlight what we
need for the compiler project. It describes the basic x86 instruction set and
assembly language.
The standard Intel/Microsoft assembler notation is used
here, since that is consistent with most of the available sources and published
documentation. The GNU assembler, which we will use in our project, has differences
in syntax, the principle one being the order of source and destination operands
in instructions are reversed.
Full details about the x86 instruction set are available in the Intel Architecture reference manuals. The two most useful are Volume I, which gives an overview of the architecture, and Volume II (now split into two volumes, IIA and IIB), which describes the individual instructions in excruciating detail.
An assembly language statement has the following format
optLabel: opcode operands ; comment
optLabel:
is an optional label. Opcode
and operands
form the assembly
language instructions. A semicolon indicates the beginning of a
comment, which continues to the end of the source line.
The layout of an assembly language program is fairly free-form. A
label may appear on a line by itself, as may a comment. An opcode
and
its operands should appear on the same line.
aLabel: mov eax,7 ; meaningful comment
There are no particular formatting requirements. For readability, it's best to start labels in the first column, indent opcodes and operands, and line these up in columns, but don't obsess over this in your compiler output.
Memory is composed of bytes, each of which has 8 bits. A 32-bit integer is stored in 4 consecutive bytes, and it's best if the address of the integer or pointer is a multiple of 4 (there is a huge performance penalty on modern processors for fetching misaligned data). This alignment restriction is observed by default and you shouldn't have to worry about it. Just be sure not to alter it by, for example, adding or subtracting a value from the stack pointer that is not a multiple of 4. The address of an integer is the address of the low-order byte, i.e., an integer at location 1000 consists of bytes 1000-1003; where 1000 contains the low-order bits and 1003 contains the high-order bits. Again, you probably don't have to worry about this.
For historical reasons, the Intel x86 descriptions refer to 4-byte quantities as "double words".
There are 8 registers that can be specified in assembly-language
instructions: eax
, ebx
, ecx
, edx
,
esi
, edi
, ebp
, and esp
. Register
esp
points to the "top" (double) word currently in use on the stack
(which grows down). It is unwise to use esp
for anything else.
Register ebp
is typically used as a pointer to a location in the stack frame of
the currently executing function.
In the CSE 413 compiler project, which will have a simple code generator, use register eax
as the primary arithmetic
register. Register ecx
can be used in binary arithmetic operations to hold
the second
operand. In a industrial-strength compiler, it's important to optimize the use of registers, but that
is beyond the scope of project for this course.
There are two registers that are used implicitly in x86 programs and cannot be referenced
by name in an assembly language program.
These are eip
, the "instruction pointer" or "program counter"; and
eflags
, which contains bits indicating the result of arithmetic and
compare instructions. (More details about eflags
appears below in the description
of conditional jump instructions.)
The basic operation of the processor is to repeatedly fetch and execute instructions.
while (running) { fetch instruction beginning at address in eip; eip <- eip + length of instruction; execute fetched instruction; }
Execution continues sequentially unless execution of an instruction
causes a jump, which is done by storing the target address in eip
(this is how conditional
and unconditional jumps, and function call and return are implemented).
The format of a typical data manipulation instruction is
opcode dst,src
Normally, one of the operands may be a register, memory location, or an integer constant; the other should be a register. In particular, it is normally not possible for both operands to refer to a memory location.
All of the variables in our compiled programs will be allocated in stack frames and referenced by addresses relative to ebp
. So the
following operand formats should be enough for our purposes (using mov
as an example):
mov eax,17 ; store 17 in eax mov eax,ecx ; copy contents of ecx into eax mov eax,[ebp-12] ; copy contents of memory location with address ; ebp-12 into eax mov [ebp+8],eax ; copy contents of eax into memory location with ; address ebp+8
Additional address formats are sometimes used, and sometimes additional information needs to be provided to the Microsoft assembler so it can pick the correct instruction. The most likely way to run into this is with instructions that use pointers to access data, when there might not be enough information for the assembler to decide whether the pointer is referring to an 8-bit byte, a 16-bit word, or a 32-bit "double" word. If the assembler complains about this, add the string "dword ptr" in front of a pointer operand to specify that you want to reference a 32-bit value. (But don't bother with this unless you need it.)
mov dword ptr [ebp+8],eax ; same as last example above, ; but data length specified explicitlyThis is not an issue with the GNU assembler, because different forms of instructions like
mov
have different names, like movl
to indicate
a 4-byte longword move.
This list describes the basic x86 data manipulation instructions, including some that you probably won't need for the compiler project.
mov dst,src ; dst <- src add dst,src ; dst <- dst+src sub dst,src ; dst <- dst-src imul dst,src ; dst <- dst*src (32-bit product; dst must be ; a register) idiv dst ; divide edx:eax by dst (edx:eax = 64-bit signed ; integer); eax <- quotient, edx <- remainder cdq ; edx:eax <- sign-extended copy of eax (useful prep for idiv) inc dst ; dst <- dst + 1 dec dst ; dst <- dst - 1 neg dst ; dst <- - dst (2's complement arithmetic negation) and dst,src ; dst <- dst & src (logical and) or dst,src ; dst <- dst | src (logical or) xor dst,src ; dst <- dst ^ src (logical exclusive or) test dst,src ; same as and, but only sets cond codes; dst not changed not dst ; dst <- ~ dst (logical complement) shl dst,count ; dst <- dst shifted left count bits shr dst,count ; dst <- dst shifted right count bits, 0 fill sar dst,count ; dst <- dst shifted right count bits, sign bit fill shld dst,fill,cnt ; dst <- dst shift left cnt bits; fill from fill (dbl shift) shrd dst,fill,cnt ; dst <- dst shift right cnt bits; fill from fill (dbl shift) rol dst,count ; dst <- dst rotated left count bits ror dst,count ; dst <- dst rotated right count bits push src ; esp <- esp-4; memory[esp] <- src ; (i.e., push src onto stack) pop dst ; dst <- memory[esp]; esp <- esp+4 ; (i.e., pop top of stack into dst ; and remove it from the stack) lea dst,src ; dst <- address of src (including any indexing or offset computation; dst must be a register)
The only control flow instructions at the assembly language level are unconditional and conditional jumps (goto). Higher-level control structures like loops and conditional statements are synthesized from these.
jmp label ; jump to instruction prefixed with label
A jmp
instruction is executed by storing the address of label
in register eip
, which is then used to fetch the next instruction to be executed.
Note: For both unconditional and conditional branches, the assembly language label need not be on the same source line as the target instruction. For example, if the code contains
label1: ; a comment label2: ; more commentary label3: push eax
a jump to either label1
, label2
, or label3
will cause execution to
continue at the push
instruction. This freedom can be helpful when writing
the code generation part of the compiler, because it allows us to place labels
in the assembly language program separately from the instructions that they
label.
Many arithmetic instructions set bits in register eflags
, as
does the cmp
(compare) instruction.
The following instructions
can be used after an arithmetic instruction (add
, sub
,
and
, or
, etc., but not imul
or idiv
)
to conditionally branch depending on the result of the arithmetic operation.
jz label ; jump to label if result == 0 jnz label ; jump to label if result != 0 jg label ; jump to label if result > 0 jng label ; jump to label if result <= 0 jge label ; jump to label if result >= 0 jnge label ; jump to label if result < 0 jl label ; jump to label if result < 0 jnl label ; jump to label if result >= 0 jle label ; jump to label if result <= 0 jnle label ; jump to label if result > 0
More useful for our project are conditional jumps following a cmp
(compare) instruction. The idea is to use cmp
to set bits in eflags
,
then execute an appropriate jump that branches or not depending on the bits set by the cmp
instruction.
A compare and jump works like this
cmp op1,op2 ; compare op1 to op2 and set bits in eflags jcc label ; jmp to label if the condition cc matches ; the current bit values in eflags
The various jcc
instructions are defined as follows,
when executed following a cmp
instruction.
je label ; jump to label if op1 == op2 jne label ; jump to label if op1 != op2 jg label ; jump to label if op1 > op2 jng label ; jump to label if op1 <= op2 jge label ; jump to label if op1 >= op2 jnge label ; jump to label if op1 < op2 jl label ; jump to label if op1 < op2 jnl label ; jump to label if op1 >= op2 jle label ; jump to label if op1 <= op2 jnle label ; jump to label if op1 > op2
In several cases, a single machine instruction has more than one name in assembly language. For instance, jz
and je
are different names for a single instruction; the different names are intended to be used in different contexts in the hopes of making the code more readable or suggestive of the operation being done. Pick a name that seems appropriate and don't worry about it; but if you see the other name in a debugger, for example, you'll know what's going on.
The call
and ret
instructions provide the basic mechanism for
transferring control. A call
pushes the
instruction pointer eip
(which contains the address of the instruction
following the call), then jumps. A ret
pops the value
on the top of the stack into eip
, which means execution continues at
the popped address, which should be the return address pushed by the call
instruction.
call label ; push current instruction pointer and jump to label ; i.e., esp <- esp-4; memory[esp] <- eip; ; eip <- memory address of label ret ; pop top of stack into eip (jumping to that location) ; i.e., eip <- memory[esp]; esp <- esp+4
The call
and ret
instructions allow us to jump to a
function and get back, but additional conventions are needed to
specify how registers are saved and restored, how arguments are
transmitted, and how a function result is made available to the
caller.
To call a function named, for example, f
, the function arguments
should be pushed onto the stack, then a call
instruction is executed to branch to
a label placed at the first instruction of the function. The instruction
following a call should adjust esp
to pop the arguments off the stack
after the function returns (the called function does not pop the arguments). The scheme is
push arguments call f add esp,"size of arguments in bytes"
In a program with only integer and pointer variables, the quantity added to
esp
after the function returns
should be 4 times the number of arguments. Each argument is a 32-bit quantity occupying 4 bytes.
The standard Win32 C calling sequence requires pushing the function arguments from right to left. For the CSE 413 project, it is probably easier to push the arguments from left to right. That will work fine provided that you do not attempt to call an external C routine that has more than one argument.
A function needs to begin with code to save the contents of some of the registers
it is going to use, set up ebp
to point to a location in the stack frame, and
allocate space for local variables by decrementing the stack pointer
appropriately. The code generated for a return
statement
should restore the stack and associated registers to the state that existed just
before the first instruction of the function was executed.
Here is a template for a function named f
f: push ebp ; save ebp on stack mov ebp,esp ; set frame pointer to current stack location sub esp,"size of locals in bytes" ; allocate local variables <code generated for the body of function f goes here> mov esp,ebp ; restore esp to value before local variables ; were allocated (i.e., pop locals) pop ebp ; restore ebp to value saved at beginning of f ret ; return to caller (instruction following call)
When the function returns, registers esp
, ebp
, ebx
,
esi
, and edi
must have their original values. The above code properly saves and
restores esp
and ebp
. In your project, you shouldn't need to use any of the other
registers in the must-be-preserved list, but if you do, you need to save and restore them
(use a push
at the beginning of the function and a pop
at the end).
For functions that return an int
or pointer result,
the value should be in eax
at the time the function returns.
Here is a short example of a Win32 C function call and function definition, with
source lines included as comments in the assembly language code.
The code
generated by your compiler won't necessarily look like this;
in particular, you
will probably push arguments from left to right (not right to left),
which affects the offsets of parameters in the function stack frame, and there are
likely to be lots of extra push
and pop
instructions, among other things, in the compiler-generated code.
; n = sumOf(17,42); push 42 ; push arguments (Win32 convention, R-to-L) push 17 call sumOf ; push return address and jump to label sumOf add esp,8 ; pop arguments after return mov [ebp+??],eax ; store function result in n (where ?? is the ; offset of n in the caller's stack frame)
; int sumOf(int x, int y) { ; int a, b; sumOf: ; function label push ebp ; function prologue (see above) mov ebp,esp sub esp,8 ; allocate local variables ; a = y; mov eax,[ebp+12] ; load parameter y mov [ebp-4],eax ; store local variable a ; b = x; mov eax,[ebp+8] ; load x mov [ebp-8],eax ; store b ; return a+b mov eax,[ebp-4] ; calculate a+b push eax mov eax,[ebp-8] pop ecx add eax,ecx ; result in eax mov esp,ebp ; function return epilogue pop ebp ret