CSE 501: Implementation of Programming Languages

Course Project Information

We will be using the Soot framework, developed at McGill University. Soot is a compiler analysis framework takes Java bytecode as input and produces optimized bytecode. An advantage of Soot is that it converts bytecode to a nice, three-address intermediate form in a control flow graph, which makes optimization much more convenient than using bytecode directly.

We modified the Soot distribution slightly to get it to work here at UW. It has been tested under Linux and Solaris, but it seems to compile faster under Linux. Since Soot requires the JDK 1.2, under Solaris you have to use the /uns javac--just include /uns/lib/jdk1.2.2/bin/ in your path before /usr/bin. To install Soot:

Download the file soot-uw.tar.gz
Run gunzip soot-uw.tar.gz
Run tar -xvf soot-uw.tar

soot-1.2.0

Set your CLASSPATH variable to include the soot/classes and jasmin/classes subdirectories of soot-1.2.0, as well as the current directory and the Java runtime libraries at /cse/courses/cse501/01wi/rt. For example, if $soot is the path to your soot-1.2.0 directory and you use some kind of C shell, type

setenv CLASSPATH .:$soot/soot/classes:$soot/jasmin/classes:/cse/courses/cse501/01wi/rt

/homes/gws/jonal/501/soot-1.2.0

To test your installation, type java soot.Main. You should see a list of options; otherwise, check to make sure your CLASSPATH includes all the right directories. If you want to experiment more, check out some of the online tutorials.

Important!

Soot already performs a number of common compiler analyses and transformations, written on top of Jimple, a nice but simple intermediate representation. For the project, you will design a more advanced intermediate representation and analysis framework. Therefore, you shouldn't use any of the analyses that already exist in Soot (or even base your design on them). Instead, you should write the analyses appropriately for your own representation and framework; ideally, they will work better or faster than the original Soot versions because of the framework improvements you have made.

Some specific guidelines:

Don't run any of the Soot optimizations in your turned-in code, especially those you are implementing yourself. Soot should transform bytecode to Jimple, and then invoke your code to transform Jimple into your intermediate representation, do your optimizations, convert back to Jimple, and then convert back to bytecode.
You may not use, or base your design on, any class in soot.toolkits.scalar. Exception: you may use classes FlowSet, BoundedFlowSet, ArrayPackedSet, ArraySparseSet, and FlowUniverse (these are just nice set implementations).
If you need to do dataflow analyses for the first part of the project, you can use the dataflow analysis framework from Soot. However, starting with the second part of the project, you will be required to use your own analysis framework over your own IR.

Soot links:

Home page
Master's thesis describing Soot (best overview--long but easy to read) here
Tutorials
Javadoc API documentation
Project Hints

General Project Information

See the handout from the first day of class for general information about the project, including the list of optimizations your compiler should handle.

First part of the project (due Feb. 5th)

In the first part of the project, you will create your own intermediate representation and provide transformations from Jimple to this intermediate representation and back. You can reuse the control flow graph representation and the three address code representation from Jimple. Thus, what you will be doing is extending Jimple to create your own IR.

Your IR should include some sort of factored def-use graph, as well as some representation for control dependences sufficient for reconstructing the CFG. The CFG itself is acceptable as the control dependence graph representation, although something like the CDG part of the PDG (program dependence graph) or use of the VDG (value dependence graph) would be cooler. Here are some examples of factored def-use graphs that you can choose from:

CFG in SSA form + regular def-use chains. Simply putting the CFG into SSA form, without any graph connecting defs and uses, is not enough for this project. If you choose to use SSA, you really need to connect defs and uses into a graph, so that you can then do dataflow analyses over the SSA def-use graph. Reconstructing the CFG is trivial, since the CFG is included in this representation; only phi nodes need to be converted back into moves along incoming edges.
CFG + factored def-use chains. These are def-use chains in which you reduce the number of def-use links by inserting merge points (phi-nodes). The difference between this and SSA is that in SSA you explicitly renumber the variables whereas in factored def-use chains you don't have to. (Explicit SSA numbering does make certain kinds of code motions easy, since different definitions of the same original source variable can be reordered without changing the data dependences of the resulting CFG.) Simple def-use chains, without factoring, are not enough for this project, because the number of links you have to store may be very large, and you won't be able to exploit the single-assignment invariant to make analyses simpler. Your intermediate representation should support the static-single-assignment invariant in some way. Reconstructing the CFG is a no-op, since the CFG is included in this representation, and there are no phi-nodes in the CFG to undo, unless instructions have been reordered in such a way that the original variable names can't be reused.
The value dependence graph is a fancier version of factored def-use chains, which also avoids the need for the CFG part altogether, by including data dependences on a global memory value to represent side-effects, and by including a gated form of phi node (a phi node with an extra input that selects which of the other incoming values to pass through). The VDG does have the problem that reconstructing the CFG afterwards is a non-trivial step.
Click's representation. You can look at the following paper for details (C. Click. Global Code Motion/Global Value Numbering. Proceedings of the SIGPLAN '95 Conference on Programming Languages Design and Implementation, June 1995). Click's representation can be viewed as a less extreme form of the VDG, that makes reconstructing the CFG easier.

You should turn in a hardcopy source code listings for the classes you add or change (with changes marked with a highlighter pen) and the first part of your project write-up (2-3 pages). This part should explain your IR design, why you designed it the way you did and any issues that came up in your design, and how it fits into the existing Soot framework.

Second part of the project (due Feb. 21st)

In the second part of the project, you will create a dataflow analysis framework that works over an arbitrary directed graph, and you will use it for some simple optimizations. You should define an interface for graphs, an lattice-theoretic interface for information at program points (edges in the graph, or at least srcs of edges in the graph), and an interface for the flow function of nodes of the graph. Using this interface, you will implement an engine that computes the best fixed-point solution to the analysis.

For this part of the project, you will also implement two optimizations using this engine: constant propagation and folding, and dead assignment elimination. Both should be implemented using your factored dataflow graph from part 1 of the project. (You will also need to be able, in the 3rd part, to write analyses that operate over the control flow graph, so the interface to your engine should be general enough to accomodate both kinds of analyses.)

As before, you should turn in a hardcopy source code listings for the classes you add or change from the previous version (with changes marked with a highlighter pen) and the second part of your project write-up (2-3 additional pages). This part should explain your analysis engine design, why you designed it the way you did and any issues that came up in your design, and how your client optimizations are implemented. You should also work up some test case examples, and show how your optimizations work on your test cases.

Third part of the project (implementation due March 5th, write-up due March 9)

Implement all other optimizations that are required for the project:

peephole optimizations:
- constant folding (including branch conditions and eliminating unreachable cde) [should already implemented in the second part of the project]
- arithmetic simplifications: multiplies by powers of two >= 2 to shifts
intraprocedural optimizations:
- constant propagation [should already implemented in the second part of the project]
- dead assignment elimination [should already implemented in the second part of the project]
- either:
  - common subexpression elimination, or
  - some kind of points-to analysis whose results are used in some other part of the compiler, e.g. to support constant propagation through the instance variables of objects, or
  - loop-invariant code motion
  If reasonable, the selected intraprocedural analysis should use the dataflow analysis engine developed as part of the second part of the project.
[Induction variable elimination was removed from the project description, since it is not clear how one could use this optimization in a bytecode-to-bytecode optimizer.]
interprocedural optimizations:
- either:
  - MOD analysis, identifying which variables (possibly including instance variables, if your project includes pointer analysis) visible in the caller might be modified by the callee(s). In Java, public static class fields can be considered global variables visible in all procedures.
  - [NEW] class analysis or class hierarchy analysis, to identify which instance method calls can only invoke a single target method. The invokespecial bytecode can be used to represent a direct procedure call in your output.
  - inlining. Inlining applies to direct procedure calls. In Java, static method calls are direct calls, as are calls to final methods or methods of final classes, as are instance method calls replaced with direct calls via class analysis or class hierarchy analysis.

On Monday, March 5, you should turn in a hardcopy source code listings for the classes you added or changed from the previous versions (with changes marked with a highlighter pen). On Friday, March 9, you should turn in your final project write-up (~10 pages). Your write-up should show how your compiler applies its analyses and transformations to illustrative test cases. Reflect on your compiler development experience, discussing the choices you made that were good ones, and the ones that turned out to be bad ones.

You should arrange some time during the final week of the quarter to demonstrate your compiler in action to either the instructor or one of the TAs.

chambers@cs.washington.edu