Compiler Overview: Code Analysis

In the previous reading and lecture we explored the first two parts of the compiler: scanning and parsing. Recall that at the end of the parsing stage, we were left with an abstract syntax tree (AST) which was a tree structure that encoded the meaning of our program. The next two phases of our compiler, code analysis and code generation, will make use of the powerful encoding the AST provides us. In this reading, we will focus on code analysis, and in lecture, we spend most of our time working on code generation.

Using the Recursive Structure of ASTs

Before we dig into the details, we want to introduce a pattern that will be useful for working with ASTs. ASTs are great because they provide a very general, recursive structure that allows us to describe any program written in our programming language. Because of the volume of possible AST structures, it can seem a bit daunting to approach problems that use them (such as code analysis and generation). The structure we will use is actually fairly simple and will look familiar to those who have worked with binary trees before.

Commonly we will have a task that we need to carry out across the entire tree. Examples include type checking each node, optimizing each node, or generating the code for each node. Due to our recursive and general definition of our AST, it becomes a headache to think about all of the possible combinations of child nodes that each node can have. Rather than try to handle each of these cases explicitly, we can make use of the recursive definition of the AST to allow the child nodes to do the work themselves. The parts of the general pattern are (in any order):

Do the task for the left child using recursion
Do the task for the right child using recursion
Do the task for the current node, usually making use of the result of the recursive calls to the children

Here's a pseudocode snippet that shows this pattern in action. In this example, we are trying to implement the function toString across our AST, which simply returns a string representation of each node and its children. For example, an Add node will return ADD(leftExpr, rightExpr), where leftExpr and rightExpr are the strings for the left and right sub-expressions of the Add node. Since leftExpr and rightExpr can be any expression in our language (constant, multiply, another add node, etc.), our Add's toString method will make use of recursion to handle the vast number of possibilities.

// Psuedocode for Add's toString

class Add {
    Exp left;
    Exp right;
    
    toString() {
        leftStr = left.toString()
        rightStr = right.toString()
        return "ADD(" + leftStr + ", " + rightStr + ")"
    }
}

The resulting code is relatively simple since it makes use of recursion to do the work for the left and right sub-expressions! This pattern is something we will heavily use in the following sections.

Code Analysis - Type Checking and Optimization

Input: Abstract syntax tree

Output: The same or equivalent abstract syntax tree

One of the most beneficial aspects of any intermediate representation, in our case an AST, is that it provides our compiler the ability to reason about our code. In an AST, we can really easily examine the expressions and statements that make up our program, and look at each of their individual parts. We will talk about two specific forms of analysis, type checking and optimization, but these aren't the only things that can be done at this stage of the compiler

Type Checking

Many languages have rules that go beyond just the syntax of the language. For example, in Java, I could write the code true + 1 and there is nothing wrong with the syntax of that expression - that is, the characters true + 1 are a valid sequence of characters for a Java program. What is wrong with the statement true + 1 is that it violates the type rules of the language. In Java, I cannot add a boolean and an int together based on the rules for types in Java. But if true + 1 is a valid sequence of characters, then we need to enforce type rules at a different step of the compilation process.

ASTs turn out to be powerful tools for type checking programs. We can really easily write type rules for each of the different expressions/statements in our program, and then do a pass over the entire tree to make sure these rules line up. In our above example, our addition AST node would likely have a set of combinations of types that are valid. In order to perform type checking, we can first type check the left sub-expression and get its resulting type, then type check the right sub-expression and get its resulting type, and finally double check that the left and right sub-expressions have compatible types for the node's type.

// Psuedocode for type checking Add

class Add {
    Exp left;
    Exp right;
    
    typeCheck() {
        // typeCheck will throw an error if types don't match
        // otherwise it returns the type of the node
        leftType = left.typeCheck()
        rightType = right.typeCheck()

        // Check for valid type combinations for add
        if (leftType == int && rightType == int) {
            return int
        } else if (leftType == int && rightType == double) {
            return double
        } else if... {
            // continue cases for valid types/returning resulting type
        }

        // if we reach this case then we have a type pair that isn't valid for add
        // for example, in Java int and boolean would fall through to this portion of code
        throw new TypeError("Incompatible types for Add")
    }
}

Optimization

One of the benefits of any language that uses a compiler is the ability to perform optimizations at compile time. Code is a communicative tool - that is, the reason we write code is to communicate to other human beings (and the computer) what our program does. There are a lot of things that we as humans write in code that is clear and concise but would result in inefficient code if it were to be directly translated and run on hardware. Sometimes we as humans can write our code differently so that it is more optimized, but this can also result in code that is hard to read and understand.

Compilers are a big help because they can perform optimizations during the compilation process. This means that humans can write the human-friendly code that is easier to read, and the compiler can help optimize it to be more efficient under the hood.

We won't go too far in-depth into all of the optimizations your compiler may do when you compile your code. The list is super long and very complicated, and people have dedicated their entire careers to discovering and implementing new optimizations. What we will do is give a brief example of a very simple optimization that will give you an idea of how our AST can help us perform these optimizations.

One example operation we might have would be to combine mathematical operations on constants into their own constant. Suppose we have the following constants in our code:

WIDTH = 8
HEIGHT = 5
LENGTH = 3

And we use those constants in the following calculation:

WIDTH * HEIGHT * LENGTH

If we were to generate assembly that directly matches the above expression, we would perform 2 multiply operations to get our result. This seems ok, but since WIDTH, HEIGHT, and LENGTH are all constants (their values won't change), why not just have our compiler compute the result ahead of time and replace the expression with a constant? A more general rule might be: if we have a mathematical operation node (such as Multiply, Add, Subtract, etc.) and the left sub-expression and right sub-expression are constants, the compiler will do the calculation and replace the mathematical operation node with a constant that is the result of the operation.

// Psuedocode for our Multiply operation

class Multiply {
    Exp left;
    Exp right;
    
    // Returns an updated, optimized node
    optimize() {
        if (left.isConstant() && right.isConstant()) {
            // If both the left and right are constants,
            // return a constant of the resulting value that replaces our Multiply node
            return Constant(left.getValue() * right.getValue())
        } else {
            // Otherwise, just return the same node
            return this
        }
    }
}

In our example, this would mean that the AST nodes corresponding to the calculation WIDTH * HEIGHT * LENGTH would be replaced with a single AST node, the constant 120. This means that our generated code now doesn't have to perform any multiplication operations - it would simply load the constant 120 instead.

This is a bit of a toy example in the sense that modern compilers might perform the optimization above but likely because of a set of much more complicated rules for optimizations. The key takeaway is that the AST allowed us to reason about our code in a way that opens up the possibility for optimizations. Our AST lets us check properties of different expressions and statements (in our example whether the left/right sub-expressions were constants) and take action by changing the AST (and thus our program) based on those properties.

The last thing I will say is that any changes that are made to our AST must result in an equivalent program from the user's viewpoint. This is a really important property of optimization, as it would be problematic if optimizations resulted in code that was visibly different to the user.

The Road Ahead

Most of our focus in lecture will be on code generation, the final step of our compiler and the focus of project 7! We will discuss specific implementation considerations related to our Hack compiler along with project 7 specific tips.