Reading Reflection 2

Due: Friday, April 17 at 5:00 PM | 50 points

The previous reflection raised doubts. This week we look at the evidence. Three practitioners who used formal methods report what they found, and the picture is more complicated than either side wants to admit.

Theme

The theme for this reflection is "The Evidence Is In." The readings come from people who have done the work. One tried to write complete specifications and found it impossible. Another wrote TLA+ specs as a contractor and watched the understanding leave with him. A third built a tool that found real bugs in LLVM. The evidence is real, but it does not point in one direction. Your job is to decide who has the better evidence.

Some readings may be behind publisher paywalls. As a UW student, you have free access through the UW Libraries proxy. See the Resources page for setup instructions.

Options (pick one)

Option A: Dodds 2025, Specifications Don't Exist

Dodds works at Galois, a company that has spent decades delivering formal verification projects. He argues that the hard problem is not proof but specification: for most real-world systems, no precise and coherent formal specification can be written. The systems we have verified successfully (compilers, cryptographic libraries, microkernels) are a tiny niche with unusually clean mathematical properties. Dodds illustrates with the PDF format, where Galois spent years on a DARPA-funded effort and produced a specification that matched neither real readers nor any reasonable notion of correctness. If you want to think about where formal methods hit a wall that is not about cost or tooling but about whether the right question can even be asked, this is the option.

Option B: Helwer 2025, A Supposedly Worthwhile Contract I'll Never Do Again

Helwer is a TLA+ consultant who has done exactly the kind of contract the AWS paper makes sound like a good idea: hire an expert, model the system, find bugs. His report is more complicated. The spec process surfaced real design problems. But the dev team's eyes glazed over at the TLA+ notation, and the understanding that mattered ended up in Helwer's head, not theirs. His post-mortem: the real value of TLA+ is the thinking it forces, and outsourcing that thinking to a contractor means the valuable thing happens in the wrong person. Hillel Wayne's alternative (pair-programming the spec with a dev team member) may be a harder sell but keeps the understanding where it belongs.

Option C: Lopes et al. 2015, Provably Correct Peephole Optimizations with Alive

This is the hardest reading on the R2 page and the most concrete success story. Lopes and colleagues built Alive, a tool that uses the Z3 SMT solver to verify LLVM compiler optimizations. They translated 334 optimizations and found 8 bugs, all confirmed by the LLVM developers. The buggiest file, MulDivRem, had 6 of its 44 transformations wrong. Alive handles LLVM's three distinct kinds of undefined behavior precisely, a domain where human intuition fails regularly. If you want to see what Z3-backed verification looks like when it works, and are prepared for a denser read, this is the option.

Enrichment

These are optional readings that go deeper into the evidence presented by the three options above. You do not need to read any of them, but if one of the options left you wanting more, start here.

Newcombe et al. 2015, How Amazon Web Services Uses Formal Methods

If Helwer's post left you wondering whether TLA+ works when the engineers writing the spec are on the team that ships the code, this is the other side of the experiment. AWS engineers learned TLA+ in two to three weeks and wrote their own specs for S3, DynamoDB, and EBS. They found bugs that had survived design review, code review, and testing, including one in DynamoDB whose shortest error trace was 35 steps long. Helwer's diagnosis was that the value leaves with the contractor; Newcombe's teams kept it. The paper is also honest about what TLA+ cannot do: emergent performance problems, like cascading failures from retry storms, remain outside its reach.

Demirbas 2026, TLA+ as a Design Accelerator: Lessons from the Industry

Where Newcombe shows TLA+ adoption in depth at one company, Demirbas shows breadth across eight projects at four organizations over a decade. His framing of TLA+ as a "design accelerator" rather than a verifier reframes what the evidence actually shows: the projects range from CosmosDB (model minimally, capture only client-facing semantics) to MongoDB distributed transactions (generate thousands of retroactive unit tests from TLA+ model traces). Each project contributes a distinct lesson about how to use formal specification without waiting for the full verification payoff.

King et al. 2000, Is Proof More Cost-Effective Than Testing?

Hard cost-benefit numbers from a real project. King's SHOLIS system used Z for specification and SPARK Ada for code, with extensive proof at both levels. Z proof found 16% of all faults with 2.5% of effort, and the provers caught specification errors that reviewers and test-case writers missed. The caveat: SHOLIS was a single-processor safety-critical system with a clean architecture, exactly the kind of naturally formalizable niche Dodds describes. The question the paper raises is not whether proof can beat testing on efficiency but how rare the conditions for that result really are.

Moy et al. 2013, Testing or Formal Verification: DO-178C Alternatives and Industrial Experience

Read alongside Lopes. Both are solver-aided verification doing work testing cannot. Lopes used an SMT solver to catch eight real bugs in LLVM's peephole optimizer; Moy describes Airbus and Dassault-Aviation using Frama-C abstract interpretation and deductive verification to replace unit and robustness testing on the A400M, A380, A350, and Falcon flight control software under a regulatory standard that explicitly permits the substitution. The cost win came from maintenance, not from the first pass.

Bennion and Habli 2014, A Candid Industrial Evaluation of Formal Software Verification Using Model Checking

Sits between Helwer and Lopes. Rolls-Royce applied model checking to a 7000-block aero-engine health monitoring model and found that 51% of real aerospace bugs were in principle detectable, but only a subset were reachable in the available model. Writing model-checkable properties took comparable effort to manual review, though the full model-checking process took roughly twice as long. Model checking found errors earlier in the development cycle, which saved rework, but the economics were a wash. The tool worked. The question is whether "works but costs the same" is enough to justify adoption.

See Reading Reflections for submission format and grading.