CSE584: Software Engineering

CSE503: Software Engineering
Lecture 21 (February 26, 1999)

David Notkin

Lackwit (and LCLint intro, if time)

Examples of approaches that check partial specifications and that exploit type information

Lackwit (O’Callahan & Jackson)

Code-oriented tool that exploits type inference
Answers queries about C programs

e.g., "locate all potential assignments to this field"

Accounts for aliasing, calls through function pointers, type casts
Efficient

e.g., answers queries about a Linux kernel (157KLOC) in under 10 minutes on a PC

Placement

Lexical tools are very general, but are often imprecise because they have no knowledge of the underlying programming language
Syntactic tools have some knowledge of the language, are harder to implement, but can give more precise answers
Semantic tools have deeper knowledge of the language, but generally don’t scale, don’t work on real languages and are hard to implement
The goal of Lackwit, thus, is to use some semantic basis, in a scalable way, on a real language (C)

It is a static tool

Can work on incomplete programs
Make assumptions about missing code, or supply stubs

Sample queries

Which integer variables contain file handles?

Can pointer foo in function bar be passed to free()? If so, what paths in the call graph are involved?

Field f of variable v has an incorrect value; where in the source might it have changed?

Which functions modify the cur_veh field of map_manager_global?

Lackwit analysis

Approximate (may return false positives)
Conservative (will not return false negatives) under some conditions

C’s type system has holes
Lackwit makes assumptions similar to those made by programmers (e.g., "no out-of-bounds memory accesses")
Lackwit is unsound only for programs that don’t satisfy these assumptions

Query commonalities

There are a huge number of names for storage locations

local and global variables; procedure parameters; for records, etc., the sub-components

Values flow from location to location, which can be associated with many different names

Archetypal query: Which other names identify locations to which a value could flow to or from a location with this given name?
Answers can be given textually or graphically

An example query

Query about the cur_veh field of map_manager_global
Shaded ovals are functions extracting fields from the global
Unshaded ovals pass pointers to the structure but don’t manipulate it
Edges between ovals are calls
Rectangles are globals
Edges to rectangles are variable accesses
Claim

This graph shows which functions would have to be checked when changing the invariants of the current vehicle object
Requires semantics, since many of the relationships are induced by aliasing over pointers

Underlying technique

Use type inference, allowing type information to be exploited to reduce information about values flowing to locations (and thus names)

But what to do in programming languages without rich type systems?

Trivial example

DollarAmt getSalary(EmployeeNum e)

Relatively standard declaration

Allows us to determine that there is no way for the value of e to flow to the result of the function

That is, e can (and surely would) affect what DollarAmt gets returned, but e itself (nor any computation that is based on e) will be able to flow to the return value

Because they have different types

Consider an alternative

int getSalary(int e)

Another, perhaps more common, way to declare the same function

This doesn’t allow the direct inference that e’s value doesn’t flow to the function return

Because they have the same type

But maybe one could analyze the program to determine this anyway; Lackwit does so by using a type inference mechanism for precision

Lackwit’s type system ignores the C type declarations

Computes new types in a richer type system

Incomplete type information

Cover example from the paper

void* return1st(void* x, void* y) {
return x; }

(a refb , b) ® f a refb

The type variable a indicates that the type of the contents of the pointer x is unconstrained

But it must be the same as the type of the contents of the return value

Increases the set of queries that Lackwit can answer with precision

Polymorphism

char* ptr1;
struct timeval* ptr2;
char** ptr3;
…
return1st(ptr1,ptr2);
return1st(ptr2,ptr3);

Both calls match the previous function declaration
This is solved (basically) by giving return1st a richer type and instantiating it at every call site

(c refb , d) ® d c refb

(e refa , f) ® c e refa

Type stuff

Modified form of Hindley-Milner algorithm "W"
Efforts made to handle
Mutable types
Recursive types
Null pointers
Uninitialized data
Type casts
Declaration order

Morphin case study

Robot control program of about 17KLOC
Vehicle object contains two queue objects
Client was investigating combining these two queues into one
Queried each queue object to discover operations performed and their contexts
The two graphs each contained 171 nodes
But each graph had only five nodes highlighted as "accessor" nodes
Example

These five matches helped identify code to be changed
grep would have returned false matches and missed matches when parameters were passed to functions
Context-sensitivity needed to distinguish the two queue objects because both are passed as arguments to the same queue functions

Recap

Helps find relationships among variables in a C program
Exploits type inference to understand values flowing to locations and thus names
Approximate, although safe under many (most?) conditions
Reasonably efficient
Although I didn’t show the numbers, they are now better than reported in the ICSE paper