Test, Don't (Just) Verify

AI is making formal verification go mainstream.

AI-assisted mechnical proving companies are raising funds on billion dollar valuations, new people are trying proof assistants, overwhelmingly Lean, at unprecedented rates. Models achieve fascinating results in competitions previously considered to contain some of the hardest problems in the world, such as IMO, ICPC, Putnam; as well as open problems in mathematics such as Erdös Problems. It's not just the hobbyists that are excited about AI-assisted proofs, from Terry Tao, to Martin Kleppman, to Ilya Sergey, prominent researchers around the world are excited and hopeful about the effects.

Let me quickly give you a run down of the argument:

There are multiple complex challenges in formal verification. The first one, and the one that is very hard to solve technically, is that most software in the world does not have a formal specification. A formal specification is a simpler mathematical description of the system we build. Algorithms have formal specifications. Data structures, protocols, data formats, safety-critical systems typically have formal specifications. The majority of the programs in the world doesn't have a formal specification, hell, most of them don't even have informal specifications. At the limit, which is where we actually are, the specification of a program is itself, the implementation is the specification. The lack of a formal specification makes it very hard to formally verify some piece of software, because what would you even verify?

The second issue is, proof engineering, the practice of writing proofs for theorems about your systems, is very hard. The proofs have many domain specific elements to them, a proof of a mathematical theorem will be very different from a proof about a programming language, and a proof about the programming language will highly depend on the underlying constructs of its theoretical framework. The widest taught proof engineering book is Software Foundations, and every chapter has a different style of proofs. Someone that went through Volume 2: Programming Language Foundations will not find the problems in Volume 6: Separation Logic Foundations intuitive or obvious. There are other problems such as the tooling for proof automation, brittleness of proofs, reusability of proofs etc. but I don't find them particularly fundamental to proof engineering itself but rather problems of the current generation, so we can leave those aside for now.

The rise of LLMs in programming vastly affects both of these points. It affects point number 1 because AI-assisted programming is a very natural fit fot specification-driven development. AI-assisted programming pushes the limits of programming from what you can implement to what you can specify and what you can verify. This is a great incentive for writing executable specifications, because then you can put the LLM inside a loop until it achieves the said objective, irrespective of the means of the achievement. I predict that this will give rise to program optimizers and translators that will be transformative of our work in those domains. However, tests are, as infamously they are, incomplete. Tests are great at finding bugs (actually they are not so great most of the time, but that's another discussion), but they cannot prove the absence of bugs. SQLite famously has millions of tests, but researchers still find bugs in SQLite, similar situations arise in almost all software.

The only way to prove the absence of bugs is formal verification, and industry has seen great examples of this. Highly cited formal verification projects include CompCert C Compiler, the random testing of which against GCC and Clang has led to finding 79 bugs in GCC and 202 bugs in Clang, and only 2 bugs in CompCert in its unverified parser, finding no bugs in its verified compilation pass, a striking win for formal verification. (Thanks to A. Jesse Jiryu Davis informing me, I learned that researchers have studied the source of failures in formally verified distributed systems, and found that wrong assumptions about the unverified parts of the system are the likely culprit.)

This makes formal verification a prime target for AI-assisted programming. Given that we have a formal specification, we can just let the machine wander around for hours, days, even weeks. If it comes back with a solution, we shall trust not to the probabilistic predictions of the so-called artificial intelligence, but the symbolic proof checker that verifies the solution. This idea has always been at the core of formal verification, the ability to have unsound proof generation coupled with sound proof checking has given rise to complex tactics, algorithms that produce proofs by searching them, to enable proof engineers in great capacity.

This is not the end of the good news. AI is also very good at writing proofs, at least much better than the average software engineer. Given that we have a perfect oracle, we can also turn the problem into an RLVR (Reinforcement Learning with Verifiable Rewards), which means we should expect the models to get even better at it as time goes on as they did in chess, go, and many other similar problems. This is the promise behind the billion dollar valuations, the companies started with impressive autoformalization techniques and automated provers for tackling competition problems and open conjectures, but the real industrial value is at the core of automating software engineering by letting the engineer write a verbal description, autoformalized into a Lean theorem, proven by the automated prover, and voila, we have our program that we can fully trust.

or do we?

... continue reading