Abstract interpretation: Today: * components of an abstract interpretation * lattice * transfer functions * examples Next time: more examples, and theoretical requirements What is the purpose of abstract interpretation? To approximate the program's computation; to estimate what the program might compute at run time; to learn something about all possible executions of the program, without executing the program infinitely many times. This can be used to check or verify a property, for example. We have already seen some examples of program analysis. Now let's formalize how to define a program analysis. Lattices What is the purpose of lattices? Why are we even talking about them in this class? A lattice represents information -- more specifically, each point in the lattice represents a different estimate about a variable's run-time value. A lattice is directly analogous to a type hierarchy and is written the same way. For example: Animal | Vertebrate / \ Mammal Reptile / \ Giraffe Elephant Here are three equivalent ways to view the lattice relationships, "<". ("<" is often written using the square subset operator, \sqsubset.) * Each lower point is-a instance of a higher point. Every mammal is-a vertebrate * Higher in the lattice, the set of possible values is larger. The set of all vertebrates includes the set of all mammals. * Higher in the lattice, the properties of the elements (or constraints on which elements are in the set) are weaker; the properties are stronger lower in the lattice. All mammals have 7 neck bones, but that is not true of all vertebrates. Here is another example of a lattice: Top = { even, odd } = unknown / \ even odd \ / bottom = {} Top is used when the estimate includes all possible values. Bottom is used when the estimate includes no possible values. Bottom represents dead code, uninitialized variables, and infinite loops (such as the value of "f(7)" where "def f(x) = f(x)"). (I will be a bit sloppy with "even" versus "{even}", but ask if anything is not clear.) A lattice consists of: * a domain, or a set of elements * the meaning of each, such as what run-time possibilities it represents * an ordering or hierarchy among the elements. It can be expressed either as: * a less-than operation < * a lub operation Given one, the other is uniquely determined. The less-than relation is not total/complete; for some elements e1 and e2, neither "e1 < e2" nor "e2 < e1" is true. The lub/ordering is used at join points, when two threads of control meet. We'll see examples shortly. What would a lattice that contains positive zero negative bottom look like? There are some theoretical properties that are required. We will discuss these next time. Here are some teasers for that discussion. * Any two points in the lattice should have a unique least upper bound. Exercise: Why? What could go wrong if this was violated? * The lattice has no infinite ascending chains. Exercise: Why? What could go wrong if this was violated? Why are infinite descending chains permitted? * The least upper bound operator must be monotonic. That is, if a <= b then f(a) <= f(b). This simple definition can be extended for functions that take multiple arguments. Exercise: Why? What could go wrong if this was violated? Transfer functions Suppose that you know that x is even and y is even. What can you say about the expression x + y ? How do you know? A transfer function represents an operation in the program, such as "+" or "*". Here is the transfer function for the addition operation, in the even/odd lattice given above. + | even odd -------+------------- even | even odd odd | odd even This is only 2 of the 4 possible argument values, and 4 of the 16 possible argument combinations, though! The full table needs to account for the top and bottom values: + | even odd top bottom -------+------------------------------- even | even odd top bottom odd | odd even top bottom top | top top top bottom bottom | bottom bottom bottom bottom Thinking of each input as a set of values can help you see why these are the right values for the transfer function. + | {even} {odd} {even,odd} {} -----------+---------------------------------------------- {even} | {even} {odd} {even,odd} {} {odd} | {odd} {even} {even,odd} {} {even,odd} | {even,odd} {even,odd} {even,odd} {} {} | {} {} {} {} We have to define transfer functions for all the statements and operations in the language. We will use a simple language with these constructs: x=5 x=y x=y+z if (e) then e else e goto join Summary: the parts of an abstract interpretation An abstract interpretation consists of: * lattice * transfer functions (which would be better named, "abstract operations") The lattice consists of: * points, together with the meaning of each (the set of run-time values that it represents or is an estimate for) * relationships between them, which can be expressed as * <= relationship * lub [Give an example of a concrete lattice, defined in each way.] If we define it as a lub, then the lub: * must have a unique value for every set of arguments * must be monotonic What are the corresponding requirements on the partial order <=? Can we express it without redefining lub? Is this the reason that formalisms usually define the lattice in terms of the lub rather than the partial order <=? More transfer functions Recall this code: x = 0; y = read_even_value(); x = y+1; y = 2*x; x = y-2; y = x/2; When we try to evaluate it, we immediately get stuck on the first line. We know how to apply the "+" operation to two abstract values, but what abstract value corresponds to the integer value 0? An "abstraction function" maps from concrete values to abstract values. For instance, it says that 0 is even, 1 is odd, and so forth. It is used most frequently for manifest constants (that is, literals) in source code. It can be viewed as a transfer function from constants to abstract values. Here are some varieties of transfer functions: * programming language operations, such as "+" and "*" * constants (a transfer function of arity 0, since the value doesn't depend on any previous abstract value); that is, a mapping from concrete domain to abstract domain * summaries for library routines, such as read_even_value() If statements, joins, and the least upper bound operation So far, we have analyzed only straight-line code. We need to handle if statements and loops. Consider this code: w = 5; x = read(); if (x is even) y = 5 w = w + y else y = 10 w = y z = y + 1 x = 2 * w When you run an if statement, either the then-clause or the else-clause is executed. By contrast, static analysis executes all possible paths in the program. More specifically, if the estimate for the boolean value of the if-condition is { true, false }, then the abstract interpretation uses the current value of the store to evaluate both the consequent and the alternative, producing a final store for each. These two threads of control re-join at the end of the if-statement. After an if statement, a variable might have a value that was assigned in the then-branch or in the else-branch. The estimate after the if statement needs to be at least as general as in each of the branches. We use the least upper bound operation (lub) for this. Given two abstract values, lub returns a new value that encompasses both of them. The lub is at least as general as each one. Loops and termination A loop can be viewed as an if-statement plus a back-edge in the CFG. This sounds problematic, because the analysis might not terminate: it tries to execute infinitely many different paths. In practice: * whenever a new abstract value flows along a control flow edge, re-execute the basic block that it leads to. (This produces an abstract value for its output edge.) * if the output of a basic block is the same as it was the last time, then terminate the analysis along that path; that is, don't re-execute the basic block that it leads to. (See the sidebar about why a fixed-point analysis is called that.) Consider the following code: x = read_positive() y = read_positive() result = 0 loop: result = result + y x = x - 1 if x != 0 goto loop What are the postconditions of the loop? That is, what properties are true after the loop? Let's analyze this code using several different abstract interpretations. 1. The "even, odd, unknown/any" abstraction. The postcondition for variables { result, x, y } is { any, even, any }. 2. Try the same abstraction, but analyze the loop, starting from each of the 4 possibilities r = even, x = even, y = even r = even, x = odd, y = even r = even, x = even, y = odd r = even, x = odd, y = odd The analysis output for each of the 4 cases is still { any, even, initial_value }. At run time, variable result is even for the first 3 cases and odd for the last case. What analysis can reveal this fact? 3. Try a variant of the evenness analysis, where the abstraction is a set of triples (one abstract value for each of the variables). Exercise: Formalize the abstract interpretation (the lattice, lub, and transfer functions). 4. Try an analysis whose domain is symbolic expressions for the original code. The analysis would not terminate. 5. Try this domain: 0 (x1-x)*y (x1-x+1)*y bottom ? Where x1 means the initial value of x assigned by read_positive(). This analysis gives a very precise result. How could we have made up this domain, though? Termination There is a guarantee of termination! The maximum number of times a loop can execute is number_of_variables * height_of_lattice Exercise: why? Loops and non-termination; widening Consider the lattice that contains expressions of the form x <= 0, x <= 1, x <= 2, etc. Symbolically execute this loop: x = 0 while !(x == answer) { x++ } For termination, it is required that there are no infinite ascending chains. If a lattice has infinite ascending chains, or merely very long ones (like height 2^64 if 64-bit numbers are involved), then an abstract interpretation can handle this via a sound heuristic called "widening". If a particular operation in the program is encountered too many times during an abstract interpretation, then the abstract interpretation guesses that it might be stuck in an infinite or very long loop. (Note: infinite/long loops in the program being analyzed and infinite/long loops in the analysis are orthogonal. Neither implies the other. Exercise: Why? Give some examples.) [TODO: show all 4 examples. An infinite/long analysis for a program that runs quickly can occur if the analysis is buggy (say, a transfer function or lub is not monotonic); is there any other possibility?] The transfer function or lub operator intentionally returns a value larger than the most precise result. This is intended to get to, or closer to, the fixed point. For example, if a loop has been analyzed 10 times, then on the 11th iteration the analysis might change all estimates to top, for any variable whose estimate changed between the 10th and 11th iterations. This heuristic is sound. (Exercise: Why?) Monotonicity Here are two functions with the same type signature, that are both used during abstract interpretation: lub: T x T -> T transfer[+]: T x T -> T These are completely unrelated functions, used for different purposes. Don't confuse them. The lub function must be monotonic. The transfer function is not required to be monotonic. (Exercise: Give one that is not. For the even/odd domain, an example is the +1 function.) The sequence of estimates that is the input or output to a statement's or basic block's transfer function is monotonic. (Recall that when we tested for termination, we thought of the transfer function as applying to the entire state at once.) Exercise: Why? How is this important in the claim of termination? [Exercise for the class: Components of an Abstract Interpretation * Lattice (as defined above) * Transfer function (as defined above) ] What is Top? English is imprecise. People sometimes speak of Top as representing "all information", and they sometimes speak of Top as representing "no information". Both are right in their own way, but you should avoid ambiguous statements that can be misinterpreted. Top represents *no* constraint on the possible values. The set of values represented by Top includes *every* possible value. Both are valid ways of thinking about Top, but each time you start to explain an analysis, choose one of them and stick with it. Likewise, Bottom represents every possible constraint on the values -- so many constraints that they are unsatisfiable. The set of values represented by Bottom is the empty set. Constant propagation Goal: for each variable, determine if its run-time value acn be computed at compile time. top ... -1 0 1 2 ... bottom domain = map from var to constant constants_at_beginning(b) = and_{p in pred(b)} f_p(constants_at_beginning(p)) where f_p is the effect of block_p on pairs in the set and where "and" is pairwise meet Here is an equivalent way to express it that some people might find simpler: constants_at_beginning(b) = and_{p in pred(b)} constants_at_end(p) constants_at_end(b) = f_b(constants_at_beginning(b)) w = 5; x = read(); if (x is even) y = 5 w = w + y else y = 10 w = y z = y + 1 x = 2 * w What follows are more examples: ---------------- Aliasing Must model the heap ---------------- Copy propagation Given a variable x, what other variables are equal to x? If there are any, then x can be optimized away. Analogous to constant propagation, can eliminate variables. Domain: Sets of equal variables Abstract values and transfer function is no longer defined over single variables. ---------------- Available expressions Eliminate redundant operations What is the abstract domain? If arbitrary symbolic expressions, does it terminate? ---------------- Live variables (backwards analysis) -- see below Reuse of registers ---------------- Reaching definitions (The domain is a bit hairy, because we are tracking *which* definition rather than just what is live. So, do available variables first, then convert it into reaching definitions.) gen kill in(v) = union_{p in pred(v)} out(p) out(v) = gen(v) union (in(v) - kill(v)) ---------------- Strictness (backwards analysis) x=read() y=x*2; print x ---------------- If you want even more examples, see Texas 380C and Rice COMP512 lectures. =========================================================================== Live variables: Exercise/example: Compute live variables. A "dead variable" is one whose value will not be used again. The dead variable's register can be re-used, or the value can be deallocated, etc. A good way to devise an analysis is: * to determine the answer manually * review your reasoning process: what information did you use, and what reasoning steps did you perform? Formalize and automate your manual work. Here is an example program to use as your test case: w = // no uses x = y = z = = z = z y = z = = x ... print x print y print z Choose an abstract domain and define transfer rules. [The key idea is that it's a backwards analysis.] ---------------- Another way to think of the same analysis of live variables: Key idea: Transfer function for x=y is (almost) s'-{x}+{y}. where s' is the live variables after x=y, because before "x=y", y is live and x is not live. More precisely: use(p) = may be used *before being defined* def(p) = must be defined (aka "kill"?) liveout(v) = union_{s in succ(v)} in(s) livein(v) = use(v) union (liveout(v) - def(v)) Transfer function for x=y is not quite as simple as s'-{x}+{y} * because you might reuse a variable, as in x:=x+1. This is more likely when you are thinking of a transfer function for an entire basic block. * Are there other reasons? x=read(); y=10; z=x+1; w=y*x; y = 5*z; x = 12; =========================================================================== May vs. must * may = true on some path * must = true on all paths Just "turn the lattice upside down". (Constant propagation is weird in that its lattice is symmetric.) We can revisit every analysis. TODO: Which ones of them are worth redoing? Optimistic vs. pessimistic analysis relation to may/must ---------------- Procedure calls How do you analyze a program that contains procedures? * inlining * summarizing * other (e.g., parameterization a la types) ---------------- Correctness vs. safety Safety: the abstract values do approximate the possible run-time values [See page 42] Beta: I->I' is safe if forall c, s : Beta(C_I[[c]] s) \squarele C_{I'} [[c]] Beta s where: Beta is the abstraction function Each I is an interpretation such as "even/odd", "concrete domain", etc. c is a command s is a store C_I[[c]] : store -> store is the interpretation of command c ---------------- [Skip this discussion of Scott domains.] Scott domains (from denotational semantics) vs Abstract interpretation domains Scott domains (example: factorial) (often used in compilers) The lattice stands for approximations to a single function value. Top = inconsistent values Bottom = not yet calculated nontermination approximation to any value Different lattice elements represent that single value; higher is more completely calculated. Abstract interpretation domains The lattice stands for a set of concrete values Top = all possible values Bottom = no possible value (dead code) Higher is a larger set of concrete values (more possibilities). Least fixed point is best approximation Proof of *least* fixed point depends on monotonicity. Guarantee of achieving least fixed point monotonic: preserves order join (set union) of accumulating semantics guarantees this counterexample: +/- with a loop that decrements a counter continuous: lub f(x) = f(lub x) counterexample: x <= c below may be infinite-height (give example) for termination need finite height, or widening safety GLB vs. LUB =========================================================================== Transfer functions We first introduced transfer functions and joins: * over a single variable, and a single statement that is implicitly in SSA form -- never use the same variable on the lhs and rhs. * then over a set of variables, but still with simple statements. * finally over potentially more complex variables, since the definitions become more complex, such as "use(p)" is "may be used *before being defined*" [Example: copy propagation.] Transfer functions and refinement So far, we have viewed a transfer function, such as that for "+", as taking two arguments (each an abstract value) and producing one result (an abstract value). Our language lets us write "+" only in the context of an assignment, so we have a transfer rule for "z = x + y". Consider the problem of guaranteeing that a program suffers no null pointer exceptions. How would you analyze this program? if (z != null) { z.f } There is no way to do this with our current way of expressing transfer functions. We need to make two extensions: * a transfer function can not only determine a new abstract value for the variable on the left-hand side of an assignment; it can also determine a new abstract value for any other variable, including both those that are passed as arguments and those that are not even mentioned. * a transfer function for a boolean-valued expression can have two different outputs (each a new abstract value). If the expression is used in an if statement, then each output is used on a different branches; otherwise, they are lubbed and that single result is used. Note that this can handle if (z != null) { z.f } but not p = z != null; if (p) { z.f } and not if (p) z = some-non-null-value if (p) z.f How could you handle those? Is it worth doing? =========================================================================== Transfer functions We have seen several ways to define a transfer function. That is, our conception of a transfer function has evolved over time. * transfer function per operation, in terms of inputs and output * transfer function for assignment, that changes the abstract state of the lhs * transfer function for assignment, that changes the abstract store for the lhs * the abstract store is an estimate for each variable (and heap location) So far, we have been able to think about each variable in isolation, and each transfer function is concerned only with the variables that appear in it -- maybe only those that appear on the left-hand side. (Use of SSA has helped us to do this -- this is an advantage of SSA.) We can think of the transfer function for the entire state as a single entity. A transfer function takes as input an entire abstract state, and it produces an entire abstract state which it can have modified arbitrarily. Finally, a transfer function could have multiple possibilities. As another example, the abstract state after "y = x.f" contains an updated value for y, but it can also reflect that x is known to be non-null. (This is essentially a hidden test and branch in the code.) =========================================================================== Fixed points (and why is the analysis called a "fixed point analysis"?) Recall that a fixed point of a function f : T -> T is a value x : T such that f(x) = x. For instance, 0 and 1 are fixed points of the sqrt function. An abstract interpretation computes an abstract state at every program point. In the simplest case, the abstract state of a program is a collection of values, one per variable. (Sometimes, the abstract state is more complex, as in the aliasing analysis.) The symbolic execution can be viewed as taking as input a collection of abstract states (one per program point) and producing as output a set of abstract states (one per program point), by locally applying the transfer function. The result of the program analysis is the fixed point of that big composed function. When the program analysis terminates, the current abstract state is the fixed point of the big composed function. It is also possible to view this more locally, at a single program point. Consider all paths from the program point back to the program point. Imagine a transfer function that represents the effect of all those paths. This transfer function has a fixed point, and the abstract interpretation computes that fixed point. General idea: start out with bottom everywhere, and iterate "You can only go up, never go down, in the lattice": * A join changes an estimate to be more accurate. The new value is always higher in the lattice than the old value. It is an approximation of all values that can reach here, and you only learn about more values that can reach this location. * A transfer function must be monotonic: if a >= b, then f(a) >= f(b). However, it is possible that f(a) < a. =========================================================================== How sound/tight is the analysis? (This is a bit of a digression.) Your analysis gives an estimate of what can occur at run time. This estimate may be precise, or it might be a wild overestimate. The same estimate may be precise for one program and loose for another program. There is no way to know a priori how accurate it will be. One way to measure how accurate it is, in to write two analysis: * a sound, conservative analysis that upper-bounds what can happen at run time * an unsound, optimistic analysis that lower-bounds what can happen at run time (running a test suite is one example of a lower-bound analysis!) You know that the answer is between the two bounds. If the bounds are near one another, you know that the answer is near both of them: both estimates are close to the truth. If the bounds are far apart, then at least one estimate is far from the truth, but you don't know which one or where the truth lies. (As a rule of thumb, a dynamic analysis with a reasonable test suite gives a closer estimate of the truth than a sound analysis.) =========================================================================== Transfer function results for Top and Bottom When Top represents "arbitrary run-time value", then often but not always, an operation one of whose arguments is Top will yield Top. When Bottom represents "no possible run-time value", then usually an operation one of whose arguments is Bottom will yield Bottom. It would be correct, but less precise, to return a different value. =========================================================================== Tip: Make your abstract domain as simple as possible The lattice for a nullness analysis could be Top = unknown = null or non-null / \ null non-null \ / Bottom = no possible values = {} But there is no need for "null" or "Bottom". * programmers don't write expressions with those types * the type system will issue a warning anytime a value is possibly null, so non-null or unknown is the only interesting distinction ===========================================================================