Formalise

Stage three of The Method. State it precisely enough to be wrong.

There is a vast distance between “I think this works” and “here is exactly what I claim, and here is precisely what would refute it.” Most software lives in the first sentence. The second sentence is the entire work of this stage.

Formalise takes the conjecture you proposed — bold, simple, written in prose — and sharpens it until it has edges. Until a machine can check it, a colleague can attack it, or reality can refute it. The tagline is not poetry; it is a method borrowed straight from the philosophy of science. A statement that no possible observation could contradict tells you nothing about the world. The same is true of code. The specification you cannot violate, the type that admits every value, the test that passes no matter what the system does — these are not safety. They are the appearance of safety, which is worse than none, because it stops you looking.

Precision is the point, not a side effect

Karl Popper’s demarcation between science and pseudo-science was falsifiability: a theory is scientific to the extent that it forbids something, that it sticks its neck out and risks being wrong.¹ Formalise is that principle turned into engineering practice. To formalise a design is to state it precisely enough that it forbids specific behaviors — and therefore precisely enough that you can find out, before your users do, whether the system honors those prohibitions.

Leslie Lamport, who has spent a career on this problem, puts the human mechanism plainly. Quoting the cartoonist Guindon, he writes: “Writing is nature’s way of letting you know how sloppy your thinking is.”² His own corollaries are sharper still. “We need to understand our programming task at a higher level before we start writing code.” And the line that should hang over every codebase: “If we do not start with a specification, every line of code we write is a patch.”² A system built without an explicit specification may still run, but its intended behavior lives in scattered local repairs rather than in a shared global claim about what the thing is supposed to do.

Lamport closes with the sentence that justifies the whole stage: “Thinking does not guarantee that you will not make mistakes. But not thinking guarantees that you will.”² Formalising is thinking made checkable.

Formal methods are not academic anymore

The objection arrives on schedule: that’s fine for ivory towers, but I ship product. The objection is increasingly hard to defend for the parts of product engineering where concurrency, distribution, and state-space explosion dominate. The rebuttal comes from the least academic place imaginable — the team running some of the largest distributed systems on Earth.

In “How Amazon Web Services Uses Formal Methods” (2015), Newcombe and colleagues report what happened when AWS engineers began writing specifications for core systems — S3, DynamoDB, and others — in TLA+, Lamport’s specification language. Their summary is unambiguous: “In each, TLA+ has added significant value, either finding subtle bugs we are sure we would not have found by other means, or giving us enough understanding and confidence to make aggressive performance optimizations without sacrificing correctness.”³ On one DynamoDB design, the specification surfaced a bug whose shortest error trace required thirty-five high-level steps — exactly the kind of defect that conventional testing, exploring a tiny slice of a distributed system’s state space, is poorly equipped to reach.

Read that finding twice, because it contains both halves of the value of formalism. The first half is the obvious one: it found bugs AWS engineers believed they would not have found by other means. The second half is the one engineers underrate: it gave them the confidence to go faster in specific high-risk places. Having a checkable specification did not merely slow the work down; in the cases Newcombe and colleagues describe, it let engineers attempt aggressive optimizations they would otherwise not have dared to try, because they could model-check that the optimized design still satisfied the spec. Precision is not inherently the enemy of speed. Imprecision often disguises its cost as caution.

The spectrum of formalism, from types to proofs

“Formalise” sounds total, as if the only options were a full TLA+ specification or nothing. In reality it is a spectrum, and much of its day-to-day value is available near the lightweight end.

The most accessible rung is the type system. Types are a specification you write inline and a machine checks on every build — a continuous, automated claim about what values may flow where. And they pay measurably, though not universally. Gao, Bird, and Barr, in “To Type or Not to Type” (2017), took a corpus of public JavaScript bugs and asked a precise question: how many would a static type checker have caught? Their answer: adding TypeScript or Flow annotations would have detected 15% of the bugs (with a 95% confidence interval of 11.5% to 18.5%).⁴ That is a meaningful result, and the authors argue it is conservative because these were public bugs — defects that had already survived testing, review, and release. But the paper does not prove that typed projects have 15% fewer total bugs, or that every project will see the same benefit. The defensible claim is narrower: static types can catch a material class of defects early, and they can be adopted incrementally, file by file.

A rung up is Design by Contract, Bertrand Meyer’s formalism of preconditions, postconditions, and invariants.⁵ A contract states, precisely and locally, what a routine requires of its caller and what it guarantees in return. Each clause is a falsifiable claim attached to the exact place it matters; an assert is the smallest possible specification, and it earns its place by being checkable at runtime. Further along sits property-based testing — Claessen and Hughes’s QuickCheck (2000) — where instead of asserting that one input yields one output, you state a property that should hold over a generated input space (“reversing a list twice yields the original,” “the decoded encoding equals the input”) and let the machine search for counterexamples.⁶ Recent empirical work on property-based testing in practice finds that experienced users value it especially for complex code and for confidence beyond conventional example-based tests, while also naming real weaknesses: properties and generators can be hard to write, and effectiveness can be hard to assess.⁷ The property is the formalised claim; the generator is the refutation engine. This is the natural bridge to the next stage, where demonstration takes over.

At the far end lies derivation itself. Edsger Dijkstra’s A Discipline of Programming (1976) develops the weakest-precondition calculus, a method for deriving a program from its specification so that it is correct by construction rather than correct by subsequent testing.⁸ Few of us write code this way for a CRUD endpoint, and we shouldn’t. But the idea casts a long shadow over the whole stage: the more precisely you state what must be true, the less of your correctness you are leaving to luck and the more you are establishing by reason.

Make this concrete with the smallest possible instance. A function takes a discount percentage and applies it to a price. The informal version trusts that callers pass something sensible. The formalised version states the claim in code: a precondition that the percentage lies between 0 and 100, a return type that cannot be negative, and a property test asserting that the discounted price is always between zero and the original across a broad generated sample, not three hand-picked examples. Now the unstated assumptions — that the percentage isn’t 150, that it isn’t −10, that floating-point rounding can’t push the result below zero — become claims the machine can check repeatedly. The bug where a stacked promo code reaches 120% off and the system starts paying customers to check out is not caught by hoping. It is caught by the precondition, the first time a test or a caller violates it. None of this is heavyweight. It can be three lines, and it converts a silent assumption into a loud, early failure.

The reason to reach for this lightweight end first is leverage per unit of effort. A type annotation can be cheap to write and, once present, can be checked on every build by a machine that does not get tired or rushed before a deadline. An invariant written as an assertion documents the claim and enforces it in the same stroke, making drift more visible than a comment alone. A property test, written once and maintained, can run hundreds of randomized refutation attempts on every CI run. These are not the formalisms that win prizes, and they do not automatically pay for themselves. They are the ones most likely to be worth trying early because the marginal cost is low and the feedback is fast.

The blueprint that no one was forced to draw

Lamport’s framing essay carries a title that is a question — “Who Builds a House Without Drawing Blueprints?” — and the answer, in our industry, is most of us, most of the time.² We would not pour a foundation from a verbal description of a house. We routinely deploy distributed systems from a Slack thread and a vibe.

The cost of skipping the blueprint is not abstract, and the canonical example is not a software company’s embarrassment but a quarter-billion-dollar crater on another planet. In 1999 NASA lost the Mars Climate Orbiter. The mishap investigation board found the cause precisely: ground software supplied by one contractor produced a thruster impulse value in US customary units — pound-force seconds — while the spacecraft’s navigation software expected SI units, newton-seconds, as the interface specification required.⁹ A factor of 4.45, unstated and unchecked at the boundary between two systems, silently corrupted the trajectory until the orbiter entered the atmosphere at the wrong altitude and was destroyed.

Here is the subtle, important part for this stage. A specification existed; it called for metric units. The failure was not the absence of a spec but an implicit assumption on one side of an interface that no formal check enforced. This is where systems actually die — not in the logic you reasoned about carefully, but in the assumption you never made explicit: the units, the null, the range, the time zone, the encoding, the “this can never be negative.” Formalising is the discipline of dragging those assumptions into the light as claims, where a checker or a reviewer can confront them, rather than leaving them as quiet expectations that hold right up until the day they catastrophically don’t.

The honest limit: formalism has a price, so spend it deliberately

A method that demanded full verification of everything would be not rigorous but innumerate, and the engineers who dismiss formalism are usually reacting, correctly, to that overreach. So let us state the limit plainly.

Formalism costs. Writing a TLA+ specification for an ordinary CRUD path is usually waste; the spec may be longer than the risky part of the code and may catch little that simpler checks would not. Full machine-checked proof, as we will see in the next stage, can cost person-years. The discipline of Formalise is therefore not “verify everything.” It is to recognize that precision exists on a dial, and to turn the dial up in proportion to the cost of being wrong. For a throwaway script, the dial sits near zero and that is correct. For the throttling logic in a payments system, the concurrency protocol in a database, the state machine in a medical device — the places where a subtle bug is measured in money, trust, or lives — you turn it up, because those are the kinds of settings where AWS reported strong returns from formal specification.

The calibration error runs in both directions, and both are common. Teams cheerfully formalise the easy, comfortable part — exhaustively typing the data model of a CRUD app — while leaving the genuinely hard part, the concurrent state machine that actually carries the risk, specified nowhere but in three engineers’ overlapping and contradictory memories. The dial belongs on the hard part. Effort spent making the safe code safer is not rigor; it is rigor’s costume, and it can leave you more confident and no less exposed.

The skill is calibration. Apply cheap formalisms — types, contracts, named assumptions, properties — where they clarify real uncertainty and create fast feedback. Reserve the expensive ones — model checking, full specification, proof — for the load-bearing core where the conjecture, if wrong, is catastrophic. Formalism you can’t afford is not virtue; formalism you skipped on the one component that could sink the system is not pragmatism. Spend the budget where being wrong is expensive.

Why this stage sits in the middle

Notice where Formalise falls in The Method: after Conjecture, before Demonstrate. That ordering is not arbitrary. You cannot demonstrate a claim you have not stated precisely, because you would not know what counts as success. The vague conjecture — “the cache should make things faster” — cannot be tested, only argued about. The formalised version — “for any key read within the TTL after a write, the cache returns the written value; cache hit ratio under the production access pattern exceeds 0.9” — can be checked, and either it holds or it doesn’t.

Formalise is the hinge of the whole method. It is where the soft material of Observe and Conjecture — what is, what might be — becomes hard enough to test. It converts belief into a claim with consequences. And by doing so it sets up the stage that follows, whose job is to take your now-precise claim and try, in earnest and with evidence, to break it.

What Formalise looks like in practice

Write the interface before the implementation when it will clarify the design. Types and signatures first. They are among the lightest specifications you have, the machine checks them automatically, and empirical work shows they can catch a measurable share of JavaScript defects.
Turn invariants into assertions and contracts. Every “this should always be true” is a checkable claim. Write it as an assert, a precondition, a postcondition — the smallest formal spec there is, placed exactly where it matters.
Name and check your assumptions, especially the boring ones. Units, ranges, nullability, ordering, time zones, encodings. The Mars Climate Orbiter shows the stricter lesson: even a specified unit can fail if the interface does not enforce it. Make the boring assumption explicit and checkable, or the catastrophic version can still hide.
Specify the genuinely hard part precisely. For tricky concurrent or distributed logic, write it down in TLA+ or, at minimum, in prose precise enough to expose the state transitions and invariants. If you cannot specify the risky part, treat that as evidence that you may not yet understand it well enough to build it safely.
Convert “should” statements into properties. “It should round-trip” becomes a property-based test that hunts counterexamples across hundreds of generated inputs. Let the machine try to refute you.
Turn the dial to match the stakes. Cheap formalisms where they create fast feedback; expensive ones where being wrong is catastrophic. Calibration is the skill.

The handoff

Formalise ends with something Observe and Conjecture could not produce: a claim with edges. Not a hope, not a sketch, but a statement precise enough that the universe can disagree with it — encoded in types the compiler enforces, contracts the runtime checks, properties a generator attacks, or a specification a model checker explores. You have stated what must be true precisely enough to be wrong.

Now you have to find out whether it is. A precise claim is an invitation to refutation, and the next stage accepts the invitation in full. You have said what holds. Demonstrate will make you show it.

Further reading: Leslie Lamport, Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers (Addison-Wesley, 2002), available free from the author at lamport.azurewebsites.net/tla/book.html.

Karl R. Popper, Conjectures and Refutations: The Growth of Scientific Knowledge (Routledge & Kegan Paul, 1963). See especially Popper’s account of falsifiability and theories that forbid observable states of affairs. ↩
Leslie Lamport, “Who Builds a House Without Drawing Blueprints?,” Communications of the ACM 58, no. 4 (April 2015): 38–41. DOI: 10.1145/2736348. Full text: cacm.acm.org. ↩ ↩² ↩³ ↩⁴
Chris Newcombe, Tim Rath, Fan Zhang, Bogdan Munteanu, Marc Brooker, and Michael Deardeuff, “How Amazon Web Services Uses Formal Methods,” Communications of the ACM 58, no. 4 (April 2015): 66–73. DOI: 10.1145/2699417. (Published in the same issue as Lamport’s essay, which it cross-references.) ↩
Zheng Gao, Christian Bird, and Earl T. Barr, “To Type or Not to Type: Quantifying Detectable Bugs in JavaScript,” Proc. 39th International Conference on Software Engineering (ICSE 2017), 758–769. DOI: 10.1109/ICSE.2017.75. Headline result: 15% of sampled public JavaScript bugs detectable by TypeScript/Flow, 95% CI [11.5%, 18.5%]; the authors describe evaluation against public bugs as conservative, but the result should not be read as a universal bug-reduction rate. ↩
Bertrand Meyer, “Applying ‘Design by Contract’,” Computer 25, no. 10 (October 1992): 40–51. DOI: 10.1109/2.161279. Preconditions, postconditions, and class invariants. ↩
Koen Claessen and John Hughes, “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs,” Proc. 5th ACM SIGPLAN International Conference on Functional Programming (ICFP 2000), 268–279. DOI: 10.1145/351240.351266. The origin of property-based testing. ↩
Harrison Goldstein, Joseph W. Cutler, David Dickstein, Benjamin C. Pierce, and Andrew Head, “Property-Based Testing in Practice,” Proceedings of the ACM on Programming Languages / ICSE 2024. DOI: 10.1145/3597503.3639581. Interview study of experienced PBT users, including strengths, weaknesses, and adoption patterns. ↩
Edsger W. Dijkstra, A Discipline of Programming (Prentice-Hall, 1976). The weakest-precondition calculus and the derivation of programs from specifications. ↩
Mars Climate Orbiter Mishap Investigation Board, Phase I Report (NASA, November 10, 1999), A. G. Stephenson et al. The board found ground software produced impulse in pound-force seconds where the interface specification required newton-seconds (a factor of 4.45). NASA-hosted PDF. ↩