Abstract interpretation:

Today:
 * components of an abstract interpretation
    * lattice
    * transfer functions
 * examples
Next time: more examples, and theoretical requirements


What is the purpose of abstract interpretation?
To approximate the program's computation; to estimate what the program
might compute at run time; to learn something about all possible executions
of the program, without executing the program infinitely many times.  This
can be used to check or verify a property, for example.

We have already seen some examples of program analysis.  Now let's
formalize how to define a program analysis.


Lattices

What is the purpose of lattices?  Why are we even talking about them in
this class?

A lattice represents information -- more specifically, each point in the
lattice represents a different estimate about a variable's run-time value.

A lattice is directly analogous to a type hierarchy and is written the same
way.  For example:

         Animal
           |
         Vertebrate
        /          \
      Mammal     Reptile
     /      \
 Giraffe  Elephant

Here are three equivalent ways to view the lattice relationships, "<".
("<" is often written using the square subset operator, \sqsubset.)
 * Each lower point is-a instance of a higher point.
   Every mammal is-a vertebrate
 * Higher in the lattice, the set of possible values is larger.
   The set of all vertebrates includes the set of all mammals.
 * Higher in the lattice, the properties of the elements (or
   constraints on which elements are in the set) are weaker; the properties
   are stronger lower in the lattice.
   All mammals have 7 neck bones, but that is not true of all vertebrates.

Here is another example of a lattice:

     Top = { even, odd } = unknown
    /   \
 even   odd
   \    /
   bottom = {}

Top is used when the estimate includes all possible values.
Bottom is used when the estimate includes no possible values.
Bottom represents dead code, uninitialized variables, and infinite loops
(such as the value of "f(7)" where "def f(x) = f(x)").
(I will be a bit sloppy with "even" versus "{even}", but ask if anything is
not clear.)

A lattice consists of:
 * a domain, or a set of elements
    * the meaning of each, such as what run-time possibilities it represents
 * an ordering or hierarchy among the elements.  It can be expressed either as:
    * a less-than operation <
    * a lub operation
   Given one, the other is uniquely determined.
   The less-than relation is not total/complete; for some elements
   e1 and e2, neither "e1 < e2" nor "e2 < e1" is true.

The lub/ordering is used at join points, when two threads of control meet.
We'll see examples shortly.

What would a lattice that contains
  positive zero negative
          bottom
look like?

There are some theoretical properties that are required.
We will discuss these next time.  Here are some teasers for that discussion.
 * Any two points in the lattice should have a unique least upper bound.
   Exercise: Why?  What could go wrong if this was violated?
 * The lattice has no infinite ascending chains.
   Exercise: Why?  What could go wrong if this was violated?
   Why are infinite descending chains permitted?
 * The least upper bound operator must be monotonic.  That is,
   if a <= b then f(a) <= f(b).
   This simple definition can be extended for functions that take multiple arguments.
   Exercise: Why?  What could go wrong if this was violated?

 
Transfer functions

Suppose that you know that x is even and y is even.
What can you say about the expression
  x + y
?  How do you know?

A transfer function represents an operation in the program, such as "+" or
"*".  Here is the transfer function for the addition operation, in the
even/odd lattice given above.

 +      | even  odd
 -------+-------------
 even   | even  odd
 odd    | odd   even

This is only 2 of the 4 possible argument values, and 4 of the 16 possible
argument combinations, though!
The full table needs to account for the top and bottom values:

 +      | even   odd    top     bottom
 -------+-------------------------------
 even   | even   odd    top     bottom
 odd    | odd    even   top     bottom
 top    | top    top    top     bottom
 bottom | bottom bottom bottom  bottom

Thinking of each input as a set of values can help you see why these are
the right values for the transfer function.

 +          | {even}      {odd}         {even,odd}      {}
 -----------+----------------------------------------------
 {even}     | {even}      {odd}         {even,odd}      {}
 {odd}      | {odd}       {even}        {even,odd}      {}
 {even,odd} | {even,odd}  {even,odd}    {even,odd}      {}
 {}         | {}          {}            {}              {}


We have to define transfer functions for all the statements and operations
in the language.  We will use a simple language with these constructs:
  x=5
  x=y
  x=y+z
  if (e) then e else e
  goto
  join


Summary: the parts of an abstract interpretation

An abstract interpretation consists of:
 * lattice
 * transfer functions (which would be better named, "abstract operations")

The lattice consists of:
 * points, together with the meaning of each (the set of run-time values that it represents or is an estimate for)
 * relationships between them, which can be expressed as
    * <= relationship
    * lub
[Give an example of a concrete lattice, defined in each way.]

If we define it as a lub, then the lub:
 * must have a unique value for every set of arguments
 * must be monotonic
What are the corresponding requirements on the partial order <=?
Can we express it without redefining lub?
Is this the reason that formalisms usually define the lattice in terms of the lub rather than the partial order <=?


More transfer functions

Recall this code:

x = 0;
y = read_even_value();
x = y+1;
y = 2*x;
x = y-2;
y = x/2;

When we try to evaluate it, we immediately get stuck on the first line.
We know how to apply the "+" operation to two abstract values, but what
abstract value corresponds to the integer value 0?

An "abstraction function" maps from concrete values to abstract values.
For instance, it says that 0 is even, 1 is odd, and so forth.
It is used most frequently for manifest constants (that is, literals) in source code.
It can be viewed as a transfer function from constants to abstract values.

Here are some varieties of transfer functions:
 * programming language operations, such as "+" and "*"
 * constants (a transfer function of arity 0, since the value
   doesn't depend on any previous abstract value);
   that is, a mapping from concrete domain to abstract domain
 * summaries for library routines, such as read_even_value()


If statements, joins, and the least upper bound operation

So far, we have analyzed only straight-line code.
We need to handle if statements and loops.

Consider this code:

w = 5;
x = read();
if (x is even)
  y = 5
  w = w + y
else
  y = 10
  w = y
z = y + 1
x = 2 * w

When you run an if statement, either the then-clause or the else-clause is
executed.  By contrast, static analysis executes all possible paths in the
program.  More specifically, if the estimate for the boolean value of the
if-condition is { true, false }, then the abstract interpretation uses the
current value of the store to evaluate both the consequent and the
alternative, producing a final store for each.  These two threads of
control re-join at the end of the if-statement.

After an if statement, a variable might have a value that was assigned in
the then-branch or in the else-branch.  The estimate after the if statement
needs to be at least as general as in each of the branches.  We use the
least upper bound operation (lub) for this.

Given two abstract values, lub returns a new value that encompasses both of
them.  The lub is at least as general as each one.


Loops and termination

A loop can be viewed as an if-statement plus a back-edge in the CFG.
This sounds problematic, because the analysis might not terminate:  it
tries to execute infinitely many different paths.

In practice:
 * whenever a new abstract value flows along a control flow edge,
   re-execute the basic block that it leads to.  (This produces an abstract
   value for its output edge.)
 * if the output of a basic block is the same as it was the last time,
   then terminate the analysis along that path; that is, don't re-execute
   the basic block that it leads to.
   (See the sidebar about why a fixed-point analysis is called that.)

Consider the following code:

x = read_positive()
y = read_positive()
result = 0
loop:
result = result + y
x = x - 1
if x != 0 goto loop

What are the postconditions of the loop?
That is, what properties are true after the loop?

Let's analyze this code using several different abstract interpretations.

1. The "even, odd, unknown/any" abstraction.
The postcondition for variables { result, x, y } is { any, even, any }.

2. Try the same abstraction, but analyze the loop, starting from
each of the 4 possibilities
 r = even, x = even, y = even
 r = even, x = odd, y = even
 r = even, x = even, y = odd
 r = even, x = odd, y = odd
The analysis output for each of the 4 cases is still
{ any, even, initial_value }.
At run time, variable result is even for the first 3 cases
and odd for the last case.  What analysis can reveal this fact?

3. Try a variant of the evenness analysis, where the abstraction is a set of triples (one abstract value for each of the variables).
Exercise: Formalize the abstract interpretation
(the lattice, lub, and transfer functions).

4. Try an analysis whose domain is symbolic expressions for the original code.  The analysis would not terminate.

5. Try this domain:
  0  (x1-x)*y  (x1-x+1)*y
      bottom
?  Where x1 means the initial value of x assigned by read_positive().
This analysis gives a very precise result.  How could we have
made up this domain, though?


Termination

There is a guarantee of termination!
The maximum number of times a loop can execute is
 number_of_variables * height_of_lattice
Exercise: why?


Loops and non-termination; widening

Consider the lattice that contains expressions of the form
  x <= 0,   x <= 1,   x <= 2,   etc.

Symbolically execute this loop:

  x = 0
  while !(x == answer) {
    x++
  }

For termination, it is required that there are no infinite ascending chains.

If a lattice has infinite ascending chains, or merely very long ones (like
height 2^64 if 64-bit numbers are involved), then an abstract interpretation
can handle this via a sound heuristic called "widening".

If a particular operation in the program is encountered too many times
during an abstract interpretation, then the abstract interpretation guesses
that it might be stuck in an infinite or very long loop.  (Note:
infinite/long loops in the program being analyzed and infinite/long loops
in the analysis are orthogonal.  Neither implies the other.
Exercise: Why?  Give some examples.)
[TODO: show all 4 examples.  An infinite/long analysis for a program that
runs quickly can occur if the analysis is buggy (say, a transfer function or
lub is not monotonic); is there any other possibility?]

The transfer function or lub operator intentionally returns a value larger
than the most precise result.  This is intended to get to, or closer to,
the fixed point.  For example, if a loop has been analyzed 10 times, then
on the 11th iteration the analysis might change all estimates to top, for
any variable whose estimate changed between the 10th and 11th iterations.

This heuristic is sound.  (Exercise:  Why?)


Monotonicity

Here are two functions with the same type signature, that are both used
during abstract interpretation:

  lub: T x T -> T
  transfer[+]: T x T -> T

These are completely unrelated functions, used for different purposes.
Don't confuse them.

The lub function must be monotonic.
The transfer function is not required to be monotonic.
(Exercise: Give one that is not.
For the even/odd domain, an example is the +1 function.)

The sequence of estimates that is the input or output to a statement's or
basic block's transfer function is monotonic.  (Recall that when we tested
for termination, we thought of the transfer function as applying to the
entire state at once.)
Exercise:  Why?  How is this important in the claim of termination?


[Exercise for the class:
Components of an Abstract Interpretation
 * Lattice (as defined above)
 * Transfer function (as defined above)
]


What is Top?

English is imprecise.  People sometimes speak of Top as representing "all
information", and they sometimes speak of Top as representing "no
information".  Both are right in their own way, but you should avoid
ambiguous statements that can be misinterpreted.

Top represents *no* constraint on the possible values.
The set of values represented by Top includes *every* possible value.
Both are valid ways of thinking about Top, but each time you start to
explain an analysis, choose one of them and stick with it.

Likewise, Bottom represents every possible constraint on the values -- so
many constraints that they are unsatisfiable.
The set of values represented by Bottom is the empty set.


Constant propagation

Goal: for each variable, determine if its run-time value acn be computed at
compile time.

    top
 ... -1 0 1 2 ...
    bottom

domain = map from var to constant

constants_at_beginning(b) = and_{p in pred(b)} f_p(constants_at_beginning(p))
where f_p is the effect of block_p on pairs in the set
and where "and" is pairwise meet

Here is an equivalent way to express it that some people might find simpler:
constants_at_beginning(b) = and_{p in pred(b)} constants_at_end(p)
constants_at_end(b) = f_b(constants_at_beginning(b))

w = 5;
x = read();
if (x is even)
  y = 5
  w = w + y
else
  y = 10
  w = y
z = y + 1
x = 2 * w


What follows are more examples:

----------------

Aliasing
Must model the heap

----------------

Copy propagation
Given a variable x, what other variables are equal to x?
If there are any, then x can be optimized away.
Analogous to constant propagation, can eliminate variables.

Domain: Sets of equal variables
Abstract values and transfer function is no longer defined over single variables.

----------------

Available expressions
Eliminate redundant operations
What is the abstract domain?
If arbitrary symbolic expressions, does it terminate?

----------------

Live variables (backwards analysis) -- see below
Reuse of registers

----------------

Reaching definitions

(The domain is a bit hairy, because we are tracking *which* definition
rather than just what is live. So, do available variables first, then
convert it into reaching definitions.)

gen
kill
in(v) = union_{p in pred(v)} out(p)
out(v) = gen(v) union (in(v) - kill(v))

----------------

Strictness (backwards analysis)

x=read()
y=x*2;
print x

----------------

If you want even more examples, see Texas 380C and Rice COMP512 lectures.

===========================================================================

Live variables:

Exercise/example:  Compute live variables.

A "dead variable" is one whose value will not be used again.  The dead
variable's register can be re-used, or the value can be deallocated, etc.

A good way to devise an analysis is:
 * to determine the answer manually
 * review your reasoning process:  what information did you use, and what
   reasoning steps did you perform?  Formalize and automate your manual work.

Here is an example program to use as your test case:

       w =     // no uses
       x =
       y = 
       z =

  = z        = z
y =

       z =

 = x        ...

      print x
      print y
      print z

Choose an abstract domain and define transfer rules.

[The key idea is that it's a backwards analysis.]

----------------

Another way to think of the same analysis of live variables:

Key idea:  Transfer function for x=y is (almost) s'-{x}+{y}.
where s' is the live variables after x=y,
because before "x=y", y is live and x is not live.

More precisely:

use(p) = may be used *before being defined*
def(p) = must be defined (aka "kill"?)
liveout(v) = union_{s in succ(v)} in(s)
livein(v) = use(v) union (liveout(v) - def(v))

Transfer function for x=y is not quite as simple as s'-{x}+{y}
  * because you might reuse a variable, as in x:=x+1.  This is more likely
    when you are thinking of a transfer function for an entire basic block.
  * Are there other reasons?

x=read();
y=10;
z=x+1;
w=y*x;
y = 5*z;
x = 12;

===========================================================================

May vs. must
 * may = true on some path
 * must = true on all paths

Just "turn the lattice upside down".
(Constant propagation is weird in that its lattice is symmetric.)
We can revisit every analysis.
TODO: Which ones of them are worth redoing?

Optimistic vs. pessimistic analysis
  relation to may/must

----------------

Procedure calls
How do you analyze a program that contains procedures?
 * inlining
 * summarizing
 * other (e.g., parameterization a la types)

----------------

Correctness vs. safety

Safety:  the abstract values do approximate the possible run-time values

[See page 42]

Beta: I->I' is safe if
  forall c, s : Beta(C_I[[c]] s) \squarele C_{I'} [[c]] Beta s
where:
 Beta is the abstraction function
 Each I is an interpretation such as "even/odd", "concrete domain", etc.
 c is a command
 s is a store
 C_I[[c]] : store -> store  is the interpretation of command c

----------------

[Skip this discussion of Scott domains.]

Scott domains (from denotational semantics) vs Abstract interpretation domains

Scott domains (example: factorial) (often used in compilers)
The lattice stands for approximations to a single function value.

      Top = inconsistent values

    Bottom = not yet calculated
             nontermination
             approximation to any value

Different lattice elements represent that single value; higher is more
completely calculated.


Abstract interpretation domains
The lattice stands for a set of concrete values

       Top = all possible values

    Bottom = no possible value (dead code)

Higher is a larger set of concrete values (more possibilities).

Least fixed point is best approximation
Proof of *least* fixed point depends on monotonicity.
Guarantee of achieving least fixed point
  monotonic: preserves order
    join (set union) of accumulating semantics guarantees this
    counterexample:  +/- with a loop that decrements a counter
  continuous:  lub f(x) = f(lub x)
    counterexample: x <= c below
    may be infinite-height (give example)
    for termination need finite height, or widening

safety

GLB vs. LUB

===========================================================================

Transfer functions

We first introduced transfer functions and joins:
 * over a single variable, and a single statement that is implicitly
   in SSA form -- never use the same variable on the lhs and rhs.
 * then over a set of variables, but still with simple statements.
 * finally over potentially more complex variables, since the definitions
   become more complex, such as "use(p)" is "may be used *before being defined*"
[Example: copy propagation.]

Transfer functions and refinement

So far, we have viewed a transfer function, such as that for "+", as taking
two arguments (each an abstract value) and producing one result (an
abstract value).  Our language lets us write "+" only in the context of an
assignment, so we have a transfer rule for "z = x + y".

Consider the problem of guaranteeing that a program suffers no null pointer
exceptions.  How would you analyze this program?

  if (z != null) {
    z.f
  }

There is no way to do this with our current way of expressing transfer functions.

We need to make two extensions:
 * a transfer function can not only determine a new abstract value for the
   variable on the left-hand side of an assignment; it can also determine a
   new abstract value for any other variable, including both those that are
   passed as arguments and those that are not even mentioned.

 * a transfer function for a boolean-valued expression can have two
   different outputs (each a new abstract value).  If the expression is
   used in an if statement, then each output is used on a different
   branches; otherwise, they are lubbed and that single result is used.

Note that this can handle

  if (z != null) {
    z.f
  }

but not

  p = z != null;
  if (p) {
    z.f
  }

and not

  if (p)
    z = some-non-null-value
  if (p)
    z.f

How could you handle those?  Is it worth doing?


===========================================================================

Transfer functions

We have seen several ways to define a transfer function.
That is, our conception of a transfer function has evolved over time.

 * transfer function per operation, in terms of inputs and output
 * transfer function for assignment, that changes the abstract state of the lhs
 * transfer function for assignment, that changes the abstract store for the lhs
    * the abstract store is an estimate for each variable (and heap location)
So far, we have been able to think about each variable in isolation, and each transfer function is concerned only with the variables that appear in it -- maybe only those that appear on the left-hand side.
(Use of SSA has helped us to do this -- this is an advantage of SSA.)

We can think of the transfer function for the entire state as a single entity.  A transfer function takes as input an entire abstract state, and it produces an entire abstract state which it can have modified arbitrarily.

Finally, a transfer function could have multiple possibilities.

As another example, the abstract state after "y = x.f" contains an updated value for y, but it can also reflect that x is known to be non-null.  (This is essentially a hidden test and branch in the code.)

===========================================================================

Fixed points (and why is the analysis called a "fixed point analysis"?)

Recall that a fixed point of a function f : T -> T is a value x : T such
that f(x) = x.  For instance, 0 and 1 are fixed points of the sqrt function.

An abstract interpretation computes an abstract state at every program
point.  In the simplest case, the abstract state of a program is a
collection of values, one per variable.  (Sometimes, the abstract state is
more complex, as in the aliasing analysis.)

The symbolic execution can be viewed as taking as input a collection of
abstract states (one per program point) and producing as output a set of
abstract states (one per program point), by locally applying the transfer
function.  The result of the program analysis is the fixed point of that
big composed function.  When the program analysis terminates, the current
abstract state is the fixed point of the big composed function.

It is also possible to view this more locally, at a single program point.
Consider all paths from the program point back to the program point.
Imagine a transfer function that represents the effect of all those paths.
This transfer function has a fixed point, and the abstract interpretation computes that fixed point.


General idea: start out with bottom everywhere, and iterate
"You can only go up, never go down, in the lattice":
 * A join changes an estimate to be more accurate.  The new value is always
   higher in the lattice than the old value.  It is an approximation of all
   values that can reach here, and you only learn about more values that
   can reach this location.
 * A transfer function must be monotonic:  if a >= b, then f(a) >= f(b).
   However, it is possible that f(a) < a.

===========================================================================

How sound/tight is the analysis?

(This is a bit of a digression.)

Your analysis gives an estimate of what can occur at run time.
This estimate may be precise, or it might be a wild overestimate.
The same estimate may be precise for one program and loose for another program.
There is no way to know a priori how accurate it will be.

One way to measure how accurate it is, in to write two analysis:
 * a sound, conservative analysis that upper-bounds what can happen at run time
 * an unsound, optimistic analysis that lower-bounds what can happen at run time
   (running a test suite is one example of a lower-bound analysis!)

You know that the answer is between the two bounds.
If the bounds are near one another, you know that the answer is near both of
them:  both estimates are close to the truth.
If the bounds are far apart, then at least one estimate is far from the
truth, but you don't know which one or where the truth lies.  (As a rule of
thumb, a dynamic analysis with a reasonable test suite gives a closer
estimate of the truth than a sound analysis.)

===========================================================================

Transfer function results for Top and Bottom

When Top represents "arbitrary run-time value", then often but not always, an operation one of whose arguments is Top will yield Top.

When Bottom represents "no possible run-time value", then usually an operation one of whose arguments is Bottom will yield Bottom.
It would be correct, but less precise, to return a different value.

===========================================================================

Tip: Make your abstract domain as simple as possible

The lattice for a nullness analysis could be

    Top = unknown = null or non-null
   /   \
 null   non-null
  \    /
  Bottom = no possible values = {}

But there is no need for "null" or "Bottom".
 * programmers don't write expressions with those types
 * the type system will issue a warning anytime a value is possibly null, so
   non-null or unknown is the only interesting distinction

===========================================================================