CSE 351, section 2: Designing an Assembler

David Cohen

1. Writing Machine Code Is No Fun

I started off with a line of binary on the board:

001100001111000111111100000000110000000000000000
It didn't take the class long to figure out that we were looking at a z86 machine instruction we were familiar with after assignment 2A: the instruction to put the value 1020 into register 1. I asked the class to discuss assignment 2A and what everyone thought of writing machine code. What in particular was an awful headache about it? The collection of complaints we came up with was:

2. Identifying Ways to Automate and Improve the Process

[As we brainstormed these ideas, I was pretty loose about what to call the helper program we were developing, depending on what function we were discussing: assembler, translator, compiler, preprocessor, etc.]

Machine Code Validator

For the most basic helper, a verifier could read the machine code bytes and validate all the encodings (e.g., the correct number of bytes for each instruction) and print out what each instruction does, so that the programmer could verify that's what he or she intended. The class agreed it would be a useful debugging tool.

Simple Assembly Instruction to Machine Code Translator

It was a small step to turn the verifier around into a translator program: the programmer would write the assembly codes (or pseudo-assembly) and the translator would output the machine code on a 1-to-1 basis. We call this an assembler. That, everyone agreed, would be a major improvement on the issue of machine code being hard to understand. Almost everyone in the class approached homework 2A by writing assembly first and then translating it to the machine code for each instruction—in other words, most of the class was already doing this step manually. We now proposed a program to do it for us.

Labels for Jump Targets

Named targets—labels—for jump instructions were a top demand, as it drove everyone nuts to have the machine code break every time we inserted a statement or otherwise modified the code. We discussed how labels would be implemented:

Substituting Variable Names for Register IDs

Next we decided to look for a way to make it easier to remember what each register was being used for. I introduced the idea of preprocessor directives, similar to the "@" directive in the simulator loader: commands for the helper program itself. Preprocessor directives are lines in the assembly file that aren't part of the assembly code; they don't get translated to machine code. A common directive in C is #define, used as "#define foo bar", which tells the C preprocessor to go through the file and replace every instance of "foo" with "bar". For example, the sim.h file in homework 2B contains a series of #define directives that let you refer to z86 opcodes in your C code by name instead of by number.

If we were to implement that directive in our assembly code translator, we could type "#define x r1" and then refer to "x" in our assembly code instead of the register number. But what would happen if we also wrote "#define y r1"? That would work if x and y should always refer to the same value or object, or if x and y are never live variables at the same time.

With these simple aliases, the programmer still chooses where to put values, but now he or she gets to choose a better name. "irmovl 1020, x" is a lot better than the binary we started with, but it isn't "x = 1020" yet. (In the compilers class, CSE 401, you build a program that translates simple Java programs into executable x86 machine code.)

Storing Variables

The next challenge was to propose a solution if we ever wanted to refer to more than 7 or 8 variables or objects at a time. What do we do when we run out of the extremely limited number of available registers? (In contrast to x86/y86/z86, the MIPS architecture has 32 registers, which is a much better approximation of infinity; but we have to work with what we've got.) Each section realized we could store some or all of the variables in memory; after a couple of passes through the code, our helper program would know where there would be free space in memory to put them. Once it has found a place to store them, we could refer to each value with an offset from a base address sitting in one of the registers. Of course, doing this means we have to add code to move the values in and out between the registers and memory. But who's going to write that code? We saw that:

  1. We would have to write our helper program to insert those lines of code itself;
  2. As a consequence, we lost the 1-to-1 correspondence between the lines of our 'real' (assembly) program and the machine code we're generating from it;
  3. As we ask our compiler to do more, we have to give it more control (we just released control of which values to assign to which registers).

3. Stepping Back

Even though all these conveniences have added some more lines to the machine code and potentially slowed it down, we usually don't care—for one thing, today's CPUs go so fast that it is more likely that I/O (memory, disk, network, keyboard) is the limiting factor in computing performance instead. I mentioned a few instances when we might want to write or look at machine code (or at least assembler) by hand, though [other than programming the PDP-11 in lab 002]:

I briefly skimmed homework 2B, describing the 4 key files (memory.h [header file recap] and memory.c, and sim.h [#defines] and sim.c) and an overview of what they need to do to write a simulator (look at the memory array, read some bytes starting at address 0, look at the opcode and figure out what the instruction is, do whatever the instruction dictates and move on the appropriate amount to the next instruction).