Assembly Languages & Machine Code

On our path to building a computer, we need to take a quick detour from our hardware studies to learn more about assembly languages. Assembly is a type of programming language that is a human readable interface of the computer specification. We will learn about the characteristics of assembly languages, which will prepare us to learn the specification of our computer.

Machine Code and Assembly Languages

You may have heard the phrase “it's all 0s and 1s” when referring to what is happening at a low level in your computer. This is actually the case with the code you run as well. CPUs process code in steps known as instructions, which are binary sequences of 0s and 1s that tell the CPU what to do.

Assembly language is a human readable format of those 0s and 1s. The important takeaway here is that every line of assembly code that you write translates roughly into one binary instruction that your CPU can execute. In other words, there is a one to one mapping of assembly language instructions to binary machine code instructions.

Take the following example assembly instruction, which adds two values together:

add 1, 3

This assembly instruction might correspond to the following binary instruction:

0b1000000100010011

This may look like jibberish, but each of these parts of the binary instruction corresponds to a part of the human-readable assembly instruction. One possible mapping could be (the interpretation will depend on the hardware):

family operation input1 input2
10 000001 0001 0011

For instance, here, 10 might indicate that this is an arithmetic operation, 000001 might correspond to the operation add, and 0001 and 0011 could be the binary values for 1 and 3, the operands!

It turns out that the original programmers coded in just 0s and 1s. Sounds like a headache! That's why assembly languages were created—they provided a more human-readable and friendly way of specifying the binary machine code instructions.

Producing Machine Code

When writing in high level languages like Java, you may have wondered what happens when you hit “compile” and how your computer understands the code you typed.

Most high-level languages are first compiled into an assembly language. The target assembly language will differ depending on what hardware the program is intended to run on. Compilation is a complicated process that we will learn more about later in this course!

So, how do we go from an assembly language to machine code? Assembly languages are translated to binary by an assembler. The assembler's job is much easier than a compiler's. Since each assembly language instruction has a corresponding machine code instruction, the process of assembling basically involves looking up the corresponding binary for the different parts of each assembly instruction.

Instruction Sets

Every hardware system has an instruction set which details what instructions can be computed by the CPU. These instruction sets vary depending on the type of hardware you are working with (often referred to as the architecture). There are many different architectures out there. Examples include Intel's x86 architecture (prominent in many laptops, servers, and computers), the ARM architecture (prominent in many mobile and ”edge” devices), and the RISC-V architecture (a newer open-source architecture that is gaining steam).

In many ways, an architecture's instruction set can be viewed as a user interface for interacting with the hardware/CPU. That is, it specifies all of the different operations, operands (or inputs), and control logic that can be performed on a CPU. This is why we will learn about our computer's assembly language and instruction set before we build it—learning about the instruction set will inform us what logic we need to provide when implementing our CPU. We will dive deeper into the components of instruction sets below.

Components of Instruction Sets

Machine Operations

Instruction sets usually have a set of operations that can be specified by an instruction. Examples are arithmetic operations (+, -), logical operations(And, Or), and flow control operations (see below for more details). Different hardware architectures and their assembly languages offer different operations, as well as different data types that you can operate on. For example, our Hack architecture won't support operations like multiplication and division, but Intel's x86 architecture does.

Registers

Registers are temporary storage locations that are in the CPU. Because they are located in the CPU (as opposed to memory, which is located outside the CPU), registers are very fast to access, but we also don't have nearly as many registers as we have slots in memory. For example, our computer will have only two registers but around 16,000 slots in memory.

Assembly programs try to make use of registers as much as possible because they are so efficient. Since we have so few registers, we won't be able to store as much temporary data in them, but you'll see that our assembly language still heavily relies on their use.

Addressing Modes

Addressing modes are the ways in which you can specify operands or inputs in an assembly language program. Usually, assembly programs provide the following options:

Our Hack assembly language will provide all of this functionality, though it will look slightly different from the syntax you see above.

Flow Control

Your program and its machine code instructions are stored in memory. After executing an instruction, the default behavior is for the CPU to move to the next instruction in memory (the next with a higher address).

But often times in our programs we don't want to execute the next instruction—perhaps we want to go back to the beginning of a loop, or skip the else branch after executing the if branch.

In order to implement complicated control logic, machine instructions provide a way to “jump” to a specific instruction instead of executing the next one.

Unconditional jumps

Take the following pseudocode:

while (true) {
    reg1++;
}

Notice how every time we get to the end of the loop, we will want to jump back to the top of the loop. In order to provide the ability to always perform a jump, assembly languages provide unconditional jumps. This means the program can specify where to jump, and it will always jump there. For example, this assembly program adds 1 to reg1, then jumps back to TOP, where it will add 1 to reg1, and then jump again, and so on and so forth:

TOP:
    add 1, reg1
    jmp TOP

Conditional jumps

Take the following pseudocode:

if (reg1 < reg2) {
    reg1++;
}
reg2++

Notice how we want to execute the if branch only. In order to provide the ability to perform a jump based on a condition, assembly languages provide conditional jumps. Usually this involves an assembly program comparing two values, and then jumping depending on the result of the comparison. For example, this assembly program compares reg1 to reg2, and skips the if branch if reg1 is greater than or equal to reg2:

    cmp reg1, reg2
    jge SKIP
    add 1, reg1
SKIP:
    add 1, reg2

Notice how the condition in the assembly code is flipped from the condition in the pseudocode. This is because in the pseudocode we are specifying when we want to enter the if branch, but in the assembly code we are specifying when we want to skip the if branch (which is the opposite action and requires the opposite condition). Since jumps provide us an easy way to skip portions of code, often we view logic in terms of when we want to skip.

The Road Ahead

Now that we looked at assembly languages in general, we will take a deep dive into the Hack assembly language we will be using. Learning the ins and outs of this language will set us up to be able to implement the hardware that it interacts with.