Compiler Overview: Scanning and Parsing

Now that we've built a computer, and spent some time learning about assembly languages, we are going to learn about compilers. A compiler is a program that translates from one language to another. This is especially useful in the context of high level languages like Java or C. Since our hardware doesn't understand these languages, we can use a compiler to translate the high level code into assembly or machine code that can be run on our hardware.

A Compiler's Goal

A compiler's goal is to translate from one programming language (we will call this the source language) to another language (we will call this the destination language). In reality, this could be between any two different programming languages, but often times compilers are used to generate executable machine code from a higher level programming language.

Specifically in this course, our compiler will translate from Jack, a high level language similar to Java, to our Hack assembly language. This will give us the capability to write programs with neat features like if-statements and loops that can be compiled, assembled, and run on the computer you built!

In order to translate between two languages, the compiler must first understand what the original program in the source language means, and then be able to translate that meaning into an equivalent program in the destination language. As we will see, this is no easy task!! By writing a compiler, you are essentially writing a computer program that understands other computer programs, something that may be simulataneously confusing and exciting.

The Parts of a Compiler

The goal for the next two readings is to give a high level overview of all of the different parts of the compiler. This will hopefully place all of the details we will talk about over the next few weeks in context of our larger goal.

The specific steps of the compiler that we will talk about in this course are:

Example: Arithmetic to Assembly

For the different steps of the compiler, we will use an example of attempting to compile an arithmetic language into equivalent Hack assembly. Our arithmetic language won't be formally defined for this example, but it generally consists of constants (e.g. 0 1 2 etc.), simple mathematical operations (e.g. + - * /), and variables (e.g. x or y, though we will ignore variable assignment for this example).

Specifically, we will attempt to convert the following arithmetic expression into Hack assembly:

x + 2 * (y + 4)

Scanning

Input: Raw text/code of source language

Output: Stream of program tokens

From our computer's perspective, the file containing the source language is just a stream of characters. The purpose of scanning is to identify important chunks or tokens within that stream of characters, and remove anything that doesn't contribute to the program's meaning.

Tokens are the building blocks of a language definition. They are each a standalone piece from which you can form different programs, and each programming language has its own definition of tokens. For our arithmetic language, a few examples (but not all of) our tokens might be:

Notice how each token conveys meaning within our program. The scanner will not produce tokens for things that don't convey semantic information about what our program does, such as whitespace and comments (both of which help humans read the code file but don't contain any behavioral information).

After scanning, the compiler will produce a list of tokens for the raw text file. For our example, this might look something like:

VAR(x) PLUS CONST(2) MULT LPAREN VAR(y) PLUS CONST(4) RPAREN

Notice how some tokens, like VAR and CONST, have the specific constant or variable used attached to them. This is because distinguishing between the variable x and the variable y is very important to the meaning of our program.

Parsing

Input: Stream of program tokens

Output: Abstract syntax tree (see below for definition)

Our stream of tokens is an important step into determining the meaning of our program, but it doesn't encode any ordering or grouping of tokens very well. For instance, how do we encode that multiplication should take precedence over addition?

In order to better encode the grouping and ordering of our tokens, most compilers make use of abstract syntax trees, or ASTs for short. An AST is a tree structure very similar to the binary trees you may have come across in earlier courses. In ASTs, there are nodes, which usually represent some feature of the language, and there are children of those nodes, which usually represent nested parts of the feature.

For example, in our AST, we would likely have a node for +, with two children, one for the left expression and one for the right expression. Note that the expressions are defined generally. They could be any other expression in our language, including another +! This allows us to recursively build an AST that can represent any program, which makes them very powerful tools.

Our example AST would likely look like the following:

PLUS Node:
left:   VAR(x)
right:  MULT NODE
        left:   CONST(2)
        right:  PLUS NODE
                left: VAR(y)
                right: CONST(2)

By executing the lowest nodes first and ending with the highest node, we can now see how our AST encodes our program's grouping/meaning better! For example, we can see that we execute 2 * (y + 4) before we add to x, and that we execute y + 4 before we multiply by 2.

You'll also notice that we lost some of the tokens from our token stream. Since our AST encodes meaning/groupings, some tokens become redundant. While the parentheses were super important for determining ordering in a token stream, they are not important in the AST (and are consequently omitted).

As stated earlier, ASTs are powerful structures. In lecture we will learn more about how to build them from a stream of tokens, as well as how to use them to do code analysis and generation.

The Road Ahead

In lecture we will go into more detail about the scanner and parser, and the following reading/lecture will talk about code analysis and code generation. This will build towards Project 7, where you will actually have a chance to implement the code generation portion of a compiler!