Reading Reflection 3

Due: Friday, May 1 at 5:00 PM | 50 points

The previous reflection presented evidence from practitioners who used formal methods. This week we ask: even where it works, what can it not do?

Theme

The theme for this reflection is "Honest Limits." The readings come from people who understand formal methods well enough to say where they break. One surveys the economic and cultural barriers that have kept adoption niche for decades. One shows that property-based testing, built on the same specification-writing skill, catches bugs at a fraction of the cost. One measures the reliability of the solver you have been using all quarter. Your job is to decide which limits matter for the work you do.

Some readings may be behind publisher paywalls. As a UW student, you have free access through the UW Libraries proxy. See the Resources page for setup instructions.

Options (pick one)

Option A: Wayne 2019, Why Don't People Use Formal Methods?

Wayne runs TLA+ and Alloy workshops for a living, and still argues that for most software the optimum lies well short of 100% correct. He splits the problem into four quadrants: code specification, code verification, design specification, and design verification. Code verification works but the economics are brutal. Design verification is technically easier but culturally blocked: programmers do not believe writing specifications helps. If you want the most systematic map of why formal methods remain niche after fifty years, this is the option.

Option B: Hughes 2016, Experiences with QuickCheck: Testing the Hard Stuff and Staying Sane

Hughes built QuickCheck, a property-based testing tool that uses the same core skill CARS teaches: writing a formal specification. But instead of checking it exhaustively with a solver, QuickCheck generates random test cases and shrinks failures to minimal counterexamples. Across three industrial case studies (Volvo automotive software, Klarna's database, a C circular buffer), the bugs were as likely to be in the specification as in the code. If writing the specification is the hard part and random testing gets you most of the way there, what is the marginal value of exhaustive checking?

Option C: Zhou 2024, The Butterfly Effect in SMT-Based Program Verification

Zhou measured something you may have already experienced: Z3 performance can be unreliable. Semantically irrelevant changes to a query, like renaming a variable or reordering assertions, cause 2.6% of verification queries to flip between success and failure. Two small commits to Z3, about 30 lines total, caused 67% of a major stability regression. The data comes from six real verified systems using the same solver you use in CARS. The solver you have been trusting all quarter is a heuristic search engine, not an oracle.

Enrichment

These are optional readings that go deeper into the limits identified by the three options above. You do not need to read any of them, but if one of the options left you wanting more, start here.

Schneier 1999, Security in the Real World: How to Evaluate Security

Schneier catalogs twenty categories of failure in deployed security products, from bad randomization to user error to poor key management. The core cryptographic algorithms are almost never the problem. The failures live in the gap between verified components and the unverified systems surrounding them. If the required options left you thinking about what verification cannot reach, Schneier provides a concrete inventory from the security domain.

Feist et al. 2024, Why Fuzzing over Formal Verification?

Trail of Bits sells fuzzing services, and they are upfront about it. Two high-profile bugs found by formal verification were reproduced by their fuzzer in under a minute. Their claim: writing good invariants is 80% of the work, and fuzzing is the easiest way to check them. If you read Hughes and wanted to see the same argument from a security auditor's perspective, this is the piece.

Cebeci et al. 2025, A Conjecture Regarding SMT Instability

Cebeci and Z3 co-creator Bjorner looked at eleven unstable queries and traced every one to a concrete root cause: six were actual Z3 bugs (three fixes already merged), two were misconfiguration, and three were misaligned expectations about how Z3 handles quantifiers. Their conjecture: most instability is fixable engineering, not a fundamental limitation. If you read Zhou and came away worried, Cebeci offers reasons for measured optimism.

Fonseca et al. 2017, An Empirical Study on the Correctness of Formally Verified Distributed Systems

Fonseca tested three verified distributed systems and found zero protocol bugs. That is the good news. The bad news: sixteen bugs total, all in the trusted computing base: the shim layer connecting verified code to unverified infrastructure, the specification itself, and the verification toolchain. One bug caused Z3 crashes to be silently swallowed, producing binaries that were reported as verified but were not. If you want the clearest empirical evidence of both what verification delivers and where it stops, this is the piece.

Mickens 2014, This World of Ours

Mickens divides threat models into two buckets: either your adversary is a state-level actor, in which case nothing will save you, or they are not, in which case basic hygiene suffices. The elaborate middle ground in academic security papers rarely corresponds to reality. Four pages of absurdist comedy that makes a structural argument about what formal guarantees can and cannot promise. If Schneier's catalog left you wanting the punchline, Mickens delivers it.

See Reading Reflections for submission format and grading.