Practice

Hands-on with ForAll: a queue library, a wrapper to verify, and a cliff where Z3 stops cooperating.

A queue library and its spec

The running example is a queue library you call but don't have source for. The header exposes five operations:

empty                    # the empty queue
push(q, x)               # add x at the back; returns the new queue
pop(q)                   # remove the front element; returns the new queue
peek(q)                  # read the front element
size(q)                  # current length

The spec from the docs translates into six axioms, organized as three pairs.

Counting.

$size (empty) = 0$
$\forall q, x . size (push (q, x)) = size (q) + 1$

FIFO peek. peek returns the oldest element. Base case followed by recursive case:

$\forall x . peek (push (empty, x)) = x$
$\forall q, x, y . peek (push (push (q, x), y)) = peek (push (q, x))$

FIFO pop. pop removes the oldest element. Same shape:

$\forall x . pop (push (empty, x)) = empty$
$\forall q, x, y . pop (push (push (q, x), y)) = push (pop (push (q, x)), y)$

The signature is uninterpreted. Q is a sort with no fixed meaning. push, pop, peek, and size are uninterpreted functions. The axioms above are the only facts Z3 has about any of them.

queue-spec.py

from z3 import (DeclareSort, Const, Function, IntSort, Int, Ints,
                ForAll, Solver, Not)

Q     = DeclareSort('Q')
empty = Const('empty', Q)
push  = Function('push', Q, IntSort(), Q)
pop   = Function('pop',  Q, Q)
peek  = Function('peek', Q, IntSort())
size  = Function('size', Q, IntSort())

q     = Const('q', Q)
x, y  = Ints('x_ y_')

queue_spec = [
    size(empty) == 0,
    ForAll([q, x],    size(push(q, x)) == size(q) + 1),
    ForAll([x],       peek(push(empty, x)) == x),
    ForAll([q, x, y], peek(push(push(q, x), y)) == peek(push(q, x))),
    ForAll([x],       pop(push(empty, x)) == empty),
    ForAll([q, x, y], pop(push(push(q, x), y)) == push(pop(push(q, x)), y)),
]

The wrapper under review

You wrote a one-line wrapper that rotates the head element to the back of the queue:

def cycle(q):
    return push(pop(q), peek(q))

Take the front element with peek, drop it from the front with pop, push it onto the back. The size is unchanged. The front advances by one:

[a, b, c]  --cycle-->  [b, c, a]

Both observations follow by inspection. The next section confirms them with Z3, mechanically and for every choice of a, b, c.

The basic invariants

The two invariants from inspection: cycle preserves the queue's size, and the new front is the element that was second from the front. Each query below asserts the negation of one and asks Z3 for a counterexample:

cycle-basic.py

def cycle(q):
    return push(pop(q), peek(q))

a, b, c = Ints('a b c')
q3 = push(push(push(empty, a), b), c)

# Property 1: cycle preserves size.
s = Solver()
for ax in queue_spec:
    s.add(ax)
s.add(Not(size(cycle(q3)) == 3))
print(s.check())   # unsat

# Property 2: the new front is the second-pushed element.
s = Solver()
for ax in queue_spec:
    s.add(ax)
s.add(Not(peek(cycle(q3)) == b))
print(s.check())   # unsat

Both queries return unsat in 1 ms. Each proof uses two axioms cooperating.

The size proof:

The size axiom fires on the outer push: size(cycle(q3)) = size(pop(q3)) + 1.
The recursive pop axiom fires twice on pop(q3), unwinding to the base case: pop(q3) = push(push(empty, b), c).
Two more size-axiom firings reduce that to size(pop(q3)) = 2.
Chain: size(cycle(q3)) = 2 + 1 = 3.

The peek proof:

The FIFO peek axiom reduces peek(q3) to a, so cycle(q3) = push(pop(q3), a).
With pop(q3) = push(push(empty, b), c) from above, this is push(push(push(empty, b), c), a).
The FIFO peek axiom fires on the outer two pushes: peek(cycle(q3)) = peek(push(empty, b)) = b.

Both invariants hold for every choice of a, b, c.

Cycle as a 3-rotation

Cycling a 3-element queue three times should bring it back to the original.

cycle-cubed.py

a, b, c = Ints('a b c')
q3 = push(push(push(empty, a), b), c)

s = Solver()
for ax in queue_spec:
    s.add(ax)
s.add(Not(cycle(cycle(cycle(q3))) == q3))
print(s.check())   # unsat

unsat in 18 ms. The recursive pop axiom fires many times (once per cycle, several times per nesting level), the FIFO peek axiom fires several times, and the result canonicalises to the original term.

The work hides inside s.check(). Z3 walked the term, matched each subterm against an axiom's pattern, instantiated, and chained the resulting equalities. This pattern-matching strategy has a name: E-matching.

Without the spec

A natural question, asked without thinking: is cycle the identity? Drop the axioms entirely and ask Z3 directly.

cycle-no-spec.py

a, b, c = Ints('a b c')
q3 = push(push(push(empty, a), b), c)

s = Solver()
s.add(cycle(q3) == q3)
print(s.check())   # sat

sat. Z3 picked an interpretation of push, pop, peek where cycling does nothing. Maybe pop is the identity. Maybe push ignores its second argument. With no axioms, Q, push, pop, peek mean nothing, and uninterpreted is literal: Z3 is free to pick any functions that satisfy the constraint.

Add the queue spec back and re-run the same query, with the additional fact that the elements are distinct so the rotation is observable:

s = Solver()
for ax in queue_spec:
    s.add(ax)
s.add(a != b, b != c, a != c)
s.add(cycle(q3) == q3)
print(s.check())   # unsat

unsat in 6 ms. The spec rules out the trivial models. Now cycle must move elements around in a specific way, and on a queue with three distinct elements that movement is visible.

A correct answer to the wrong question. Same code, same query, two different results. The spec was the difference. Without it, Z3 answered a problem we did not mean to ask: "is there any interpretation of these symbols that makes my claim true?" The honest answer was yes, and it was useless. The solver answers the question its inputs encode.

When the spec drifts

You don't drop the spec. You import the wrong one.

The team next door also ships a stack library, with the same five names: empty, push, pop, peek, size. Their docs say:

$size (empty) = 0$
$\forall q, x . size (push (q, x)) = size (q) + 1$
$\forall q, x . peek (push (q, x)) = x$ (LIFO peek)
$\forall q, x . pop (push (q, x)) = q$ (LIFO pop)

Counting is the same. peek and pop look at the most recently pushed element instead of the oldest. Same operation names, opposite discipline.

Imagine your build system pulled in the stack header by mistake. Your cycle wrapper compiles. Z3, run against the stack spec, decides:

cycle-wrong-spec.py

stack_spec = [
    size(empty) == 0,
    ForAll([q, x], size(push(q, x)) == size(q) + 1),
    ForAll([q, x], peek(push(q, x)) == x),
    ForAll([q, x], pop(push(q, x)) == q),
]

a, b, c = Ints('a b c')
q3 = push(push(push(empty, a), b), c)

s = Solver()
for ax in stack_spec:
    s.add(ax)
s.add(Not(cycle(q3) == q3))
print(s.check())   # unsat -- "cycle is the identity"

unsat in 6 ms. Z3 has proved that cycle is the identity.

Walk through it. Under stack semantics, pop(push(push(push(empty,a),b),c)) = push(push(empty,a),b) (LIFO pop strips the top). And peek(push(push(push(empty,a),b),c)) = c (LIFO peek reads the top). So $cycle (q_{3}) = push (pop (q_{3}), peek (q_{3})) = push (push (push (empty, a), b), c) = q_{3} .$

The proof is honest. Under the stack spec, cycle really is the identity. The problem is that your library is a queue, and against a queue your cycle rotates.

Two specs, same code, opposite answers, both proofs valid. The verifier did not lie. The spec did. This is the failure mode the formal-methods community calls spec drift: a verified guarantee that no longer corresponds to what the system actually does. It is one of the few ways a verifier can ship a wrong answer with full confidence.

Every proof on this page came back in milliseconds. Practice continues with more ForAll examples and closes at the cliff, where small changes to an axiom flip Z3 from instant to unknown.

Stating what you know

Z3's built-in theories don't know that your abs returns non-negative integers, or that your decoder undoes your encoder. Facts about your uninterpreted functions are what ForAll is for.

A non-negative function

01-positive-axiom.py

from z3 import Function, IntSort, Int, ForAll, Solver, Not

f = Function('f', IntSort(), IntSort())
x, y = Int('x'), Int('y')

s = Solver()
s.add(ForAll([x], f(x) >= 0))
s.add(Not(f(y) >= 0))
print(s.check())   # unsat

unsat. Z3 instantiates the axiom at x := y, derives f(y) >= 0, contradicts the negation.

Drop the axiom and Z3 picks freely:

s = Solver()
s.add(Not(f(y) >= 0))
print(s.check())   # sat
print(s.model())   # f(y) = -1 (or some other negative)

No axiom, no constraint. Uninterpreted is literal.

Injectivity from a round-trip spec

Suppose your encoder and decoder satisfy

\forall x . decode (encode (x)) = x

Does it follow that encode is injective, i.e., distinct inputs give distinct outputs? You can prove it from the spec alone.

02-encoder-injective.py

from z3 import Function, IntSort, Ints, ForAll, Solver

encode = Function('encode', IntSort(), IntSort())
decode = Function('decode', IntSort(), IntSort())
x, a, b = Ints('x a b')

s = Solver()
s.add(ForAll([x], decode(encode(x)) == x))
s.add(a != b)
s.add(encode(a) == encode(b))
print(s.check())   # unsat

unsat. Two distinct inputs cannot land at the same encoded output if decoding gets you back to where you started.

The proof Z3 found uses the axiom twice and EUF once. The axiom fires at x := a, giving decode(encode(a)) = a; and at x := b, giving decode(encode(b)) = b. From encode(a) == encode(b), EUF concludes decode(encode(a)) == decode(encode(b)). Substitute: a == b, contradicting a != b.

Two layers of reasoning:

A universal axiom, instantiated at the ground terms in the formula.
EUF closing the loop with congruence.

Specs that compose

A ForAll axiom is a re-usable rule. Multiple axioms can chain through a single query, and the solver combines them with the built-in arithmetic and equality reasoning you already have.

Clamp respects its upper bound

clamp is the bound-a-value-to-an-interval idiom found in graphics, audio, signal processing. You don't have source for it. The library gives you a behavioral spec:

\forall v, l o, h i . clamp (v, l o, h i) = min (max (v, l o), h i)

Plus the upper-bound spec for min:

\forall x, y . min (x, y) \leq x \forall x, y . min (x, y) \leq y

max stays uninterpreted; we don't need its spec for this property. From the three axioms, prove that clamp never exceeds hi.

03-clamp-bound.py

from z3 import Function, IntSort, Ints, ForAll, Solver, Not

mmin  = Function('min',   IntSort(), IntSort(), IntSort())
mmax  = Function('max',   IntSort(), IntSort(), IntSort())
clamp = Function('clamp', IntSort(), IntSort(), IntSort(), IntSort())
x, y, v, lo, hi = Ints('x y v lo hi')

s = Solver()
s.add(ForAll([x, y], mmin(x, y) <= x))
s.add(ForAll([x, y], mmin(x, y) <= y))
s.add(ForAll([v, lo, hi],
             clamp(v, lo, hi) == mmin(mmax(v, lo), hi)))

s.add(Not(clamp(v, lo, hi) <= hi))
print(s.check())   # unsat

unsat. Two axioms cooperating. The clamp spec fires at (v, lo, hi), unfolding to min(max(v, lo), hi). That introduces a fresh min(...) ground term, which fires the second min axiom at (max(v, lo), hi). Substitute: clamp(v, lo, hi) <= hi.

The min axioms here are partial. They say min returns some value at most each argument; they don't pin which one. That's enough to prove clamp <= hi (the upper bound holds either way) but not enough for clamp >= lo, which would need a stronger spec forcing min(x, y) to actually return one of its arguments.

Min over a chain

A min over a chain of nested calls is at most any individual element. With the same upper-bound axioms:

mmin = Function('min', IntSort(), IntSort(), IntSort())
x, y, a, b, c = Ints('x y a b c')

s = Solver()
s.add(ForAll([x, y], mmin(x, y) <= x))
s.add(ForAll([x, y], mmin(x, y) <= y))

s.add(Not(mmin(mmin(a, b), c) <= a))
print(s.check())   # unsat

unsat. The axiom fires twice. At (a, b), it gives min(a, b) <= a. At (min(a, b), c), it gives min(min(a, b), c) <= min(a, b). Z3 chains them via integer arithmetic: min(min(a, b), c) <= a.

Same axiom, two ground terms, two instantiations. Deeper nesting fires more times. This is what "ForAll" is doing under the hood even when the proof feels obvious: walking the formula's terms, matching each one against the axiom's pattern, asserting the instance.

The cliff

ForAll works until it doesn't. Two demos in.

Three near-identical axioms

Predict before running. For each axiom below, will Z3 return a model?

03-forall-cliff.py

from z3 import Function, IntSort, Int, ForAll, Solver

x = Int('x')
f = Function('f', IntSort(), IntSort())

axioms = [
    ('f(x) > x',       ForAll([x], f(x) > x)),
    ('f(x) / 2 == x',  ForAll([x], f(x) / 2 == x)),
    ('f(x) == 2 * x',  ForAll([x], f(x) == 2 * x)),
]

for label, axiom in axioms:
    s = Solver()
    s.set('timeout', 5000)   # 5-second cap
    s.add(axiom)
    print(f'{label:20s}  {s.check()}')

All three describe f as relating each integer to something that depends on it. Run them.

f(x) > x              unknown
f(x) / 2 == x         unknown
f(x) == 2 * x         sat

f(x) == 2 * x is a defining equation; Z3 builds the obvious model where f doubles its argument. The other two are inequalities that admit only infinite models, and Z3's heuristics for synthesizing one in bounded time give up.

f(x) > x and f(x) == 2 * x are syntactic neighbors, and Z3 treats them as if they live in different countries.

Same formula, different trigger

Now an axiom that is the same in both runs except for one annotation, with opposite results.

from z3 import Function, IntSort, Int, Ints, ForAll, Solver

f = Function('f', IntSort(), IntSort())
g = Function('g', IntSort(), IntSort())
a, b, c = Ints('a b c')
x = Int('x')

# Pattern f(g(x)).
s = Solver()
s.set(auto_config=False, mbqi=False)
s.add(ForAll(x, f(g(x)) == x, patterns=[f(g(x))]))
s.add(g(a) == c, g(b) == c, a != b)
print(s.check())   # unknown

# Pattern g(x).
s = Solver()
s.set(auto_config=False, mbqi=False)
s.add(ForAll(x, f(g(x)) == x, patterns=[g(x)]))
s.add(g(a) == c, g(b) == c, a != b)
print(s.check())   # unsat

The axiom says "f inverts g." Three ground facts say g(a) = c, g(b) = c, a ≠ b.

Apply f to both sides of g(a) = c. By the axiom, f(g(a)) = a, so a = f(c). By the same argument with b: b = f(c). So a = b, contradicting a ≠ b. The formula is unsatisfiable.

But with the trigger f(g(x)), Z3 returns unknown. Switch the trigger to g(x) and Z3 returns unsat in microseconds.

The trigger tells Z3 when to instantiate the axiom. Pattern f(g(x)) looks for ground terms shaped like f(g(_)) in the formula; there are none. Only g(a) and g(b) appear. The axiom never fires, the contradiction is never derived, Z3 gives up.

Pattern g(x) looks for ground terms shaped like g(_). Two of those: g(a) and g(b). The axiom fires at each, equality reasoning closes the loop.

Triggers are a little programming language for controlling when each axiom fires. Pick one poorly and a provable formula becomes "unknown."

Source

The queue and stack axioms are standard equational specifications of algebraic data types, the style introduced by Goguen, Thatcher, and Wagner ("Initial algebra semantics," 1977) and developed in functional-programming pedagogy by Bird and Wadler. The cycle wrapper and the spec-drift framing are original.

The trigger demos are based on examples from the Z3 quantifier instantiation literature, in particular the discussion of pattern selection in Michał Moskal, "Programming with triggers" (SMT 2009), and the worked examples in the Z3 programming guide on advanced quantifier handling.