CSE 501: Implementation of Programming Languages
Course Project Information
We will be using the Soot framework, developed at McGill University.
Soot is a compiler analysis framework takes Java bytecode as input and
produces optimized bytecode. An advantage of Soot is that it converts
bytecode to a nice, three-address intermediate form in a control flow graph,
which makes optimization much more convenient than using bytecode directly.
We modified the Soot distribution slightly to get it to work here at
UW. It has been tested under Linux and Solaris, but it seems to compile
faster under Linux. Since Soot requires the JDK 1.2, under Solaris
you have to use the /uns javac--just include /uns/lib/jdk1.2.2/bin/ in
your path before /usr/bin. To install Soot:
-
Download the file soot-uw.tar.gz
-
Run gunzip soot-uw.tar.gz
-
Run tar -xvf soot-uw.tar
This will create a directory soot-1.2.0
-
Set your CLASSPATH variable to include the soot/classes and jasmin/classes
subdirectories of soot-1.2.0, as well as the current directory and the
Java runtime libraries at /cse/courses/cse501/01wi/rt. For
example, if $soot is the path to your soot-1.2.0 directory and you use
some kind of C shell, type
setenv CLASSPATH .:$soot/soot/classes:$soot/jasmin/classes:/cse/courses/cse501/01wi/rt
For Jonathan $soot might be /homes/gws/jonal/501/soot-1.2.0
-
To test your installation, type java soot.Main. You should
see a list of options; otherwise, check to make sure your CLASSPATH includes
all the right directories. If you want to experiment more, check
out some of the online tutorials.
Important!
Soot already performs a number of common compiler analyses and transformations,
written on top of Jimple, a nice but simple intermediate representation.
For the project, you will design a more advanced intermediate representation
and analysis framework. Therefore, you shouldn't use any of the analyses
that already exist in Soot (or even base your design on them). Instead,
you should write the analyses appropriately for your own representation
and framework; ideally, they will work better or faster than the original
Soot versions because of the framework improvements you have made.
Some specific guidelines:
-
Don't run any of the Soot optimizations in your turned-in
code, especially those you are implementing yourself. Soot
should transform bytecode to Jimple, and then invoke your code to transform
Jimple into your intermediate representation, do your optimizations, convert
back to Jimple, and then convert back to bytecode.
-
You may not use, or base your design on, any class
in soot.toolkits.scalar. Exception: you may use classes FlowSet,
BoundedFlowSet, ArrayPackedSet, ArraySparseSet, and FlowUniverse (these
are just nice set implementations).
-
If you need to do dataflow analyses for the first
part of the project, you can use the dataflow analysis framework from Soot.
However, starting with the second part of the project, you will
be required to use your own analysis framework over your own IR.
Soot links:
General Project Information
See the handout from the first day of class for general information about
the project, including the list of optimizations your compiler should handle.
First part of the project (due Feb. 5th)
In the first part of the project, you will create your own intermediate
representation and provide transformations from Jimple to this intermediate
representation and back. You can reuse the control flow graph representation
and the three address code representation from Jimple. Thus, what you will
be doing is extending Jimple to create your own IR.
Your IR should include some sort of factored def-use graph, as
well as some representation for control dependences sufficient for reconstructing
the CFG. The CFG itself is acceptable as the control dependence graph representation,
although something like the CDG part of the PDG (program dependence graph)
or use of the VDG (value dependence graph) would be cooler. Here are some
examples of factored def-use graphs that you can choose from:
-
CFG in SSA form + regular def-use chains. Simply putting the CFG
into SSA form, without any graph connecting defs and uses, is
not
enough for this project. If you choose to use SSA, you really need to connect
defs and uses into a graph, so that you can then do dataflow analyses over
the SSA def-use graph. Reconstructing the CFG is trivial, since the CFG
is included in this representation; only phi nodes need to be converted
back into moves along incoming edges.
-
CFG + factored def-use chains. These are def-use chains in which
you reduce the number of def-use links by inserting merge points (phi-nodes).
The difference between this and SSA is that in SSA you explicitly renumber
the variables whereas in factored def-use chains you don't have to. (Explicit
SSA numbering does make certain kinds of code motions easy, since different
definitions of the same original source variable can be reordered without
changing the data dependences of the resulting CFG.) Simple def-use chains,
without factoring, are not enough for this project, because the
number of links you have to store may be very large, and you won't be able
to exploit the single-assignment invariant to make analyses simpler. Your
intermediate representation should support the static-single-assignment
invariant in some way. Reconstructing the CFG is a no-op, since the CFG
is included in this representation, and there are no phi-nodes in the CFG
to undo, unless instructions have been reordered in such a way that the
original variable names can't be reused.
-
The value dependence graph is a fancier version of factored def-use
chains, which also avoids the need for the CFG part altogether, by including
data dependences on a global memory value to represent side-effects, and
by including a gated form of phi node (a phi node with an extra input that
selects which of the other incoming values to pass through). The VDG does
have the problem that reconstructing the CFG afterwards is a non-trivial
step.
-
Click's representation. You can look at the following paper
for details (C. Click. Global Code Motion/Global Value
Numbering. Proceedings of the SIGPLAN '95 Conference on Programming
Languages Design and Implementation, June 1995). Click's
representation can be viewed as a less extreme form of the VDG, that
makes reconstructing the CFG easier.
You should turn in a hardcopy source code listings for the classes you
add or change (with changes marked with a highlighter pen) and the first
part of your project write-up (2-3 pages). This part should explain your
IR design, why you designed it the way you did and any issues that came
up in your design, and how it fits into the existing Soot framework.
Second part of the project (due Feb. 21st)
In the second part of the project, you will create a dataflow analysis
framework that works over an arbitrary directed graph, and you will
use it for some simple optimizations. You should define an interface
for graphs, an lattice-theoretic interface for information at program
points (edges in the graph, or at least srcs of edges in the graph),
and an interface for the flow function of nodes of the graph. Using
this interface, you will implement an engine that computes the best
fixed-point solution to the analysis.
For this part of the project, you will also implement two
optimizations using this engine: constant propagation and folding, and
dead assignment elimination. Both should be implemented using your
factored dataflow graph from part 1 of the project. (You will also
need to be able, in the 3rd part, to write analyses that operate over
the control flow graph, so the interface to your engine should be
general enough to accomodate both kinds of analyses.)
As before, you should turn in a hardcopy source code listings for the
classes you add or change from the previous version (with changes
marked with a highlighter pen) and the second part of your project
write-up (2-3 additional pages). This part should explain your
analysis engine design, why you designed it the way you did and any
issues that came up in your design, and how your client optimizations
are implemented. You should also work up some test case examples, and
show how your optimizations work on your test cases.
Third part of the project (implementation due March 5th, write-up
due March 9)
Implement all other optimizations that are required for the project:
- peephole optimizations:
- constant folding (including branch conditions and eliminating
unreachable cde) [should already implemented in the second part of the
project]
- arithmetic simplifications: multiplies by powers of two >= 2 to
shifts
- intraprocedural optimizations:
- constant propagation [should already implemented in the second
part of the project]
- dead assignment elimination [should already implemented in the
second part of the project]
- either:
- common subexpression elimination, or
- some kind of points-to analysis whose results are used in some
other part of the compiler, e.g. to support constant propagation
through the instance variables of objects, or
- loop-invariant code motion
If reasonable, the selected intraprocedural analysis should use the
dataflow analysis engine developed as part of the second part of the
project.
[Induction variable elimination was removed from the project
description, since it is not clear how one could use this optimization
in a bytecode-to-bytecode optimizer.]
- interprocedural optimizations:
- either:
- MOD analysis, identifying which variables (possibly including
instance variables, if your project includes pointer analysis) visible
in the caller might be modified by the callee(s). In Java, public
static class fields can be considered global variables visible in all
procedures.
- [NEW] class analysis or class hierarchy analysis, to identify
which instance method calls can only invoke a single target method.
The invokespecial bytecode can be used to represent a direct procedure
call in your output.
- inlining. Inlining applies to direct procedure calls. In Java,
static method calls are direct calls, as are calls to final methods or
methods of final classes, as are instance method calls replaced with
direct calls via class analysis or class hierarchy analysis.
On Monday, March 5, you should turn in a hardcopy source code listings
for the classes you added or changed from the previous versions (with
changes marked with a highlighter pen). On Friday, March 9, you
should turn in your final project write-up (~10 pages). Your write-up
should show how your compiler applies its analyses and transformations
to illustrative test cases. Reflect on your compiler development
experience, discussing the choices you made that were good ones, and
the ones that turned out to be bad ones.
You should arrange some time during the final week of the quarter to
demonstrate your compiler in action to either the instructor or one of
the TAs.
chambers@cs.washington.edu