Claude Can (Sometimes) Prove It

Let me get right to the point without any nonsense about aliens:

Claude Code, the new AI coding agent from Anthropic, is pretty good at interactive theorem proving (ITP). I find this very surprising, and you probably should too.

Interactive theorem proving tools such as Lean are the most powerful and trustworthy kind of formal methods tool. They have been used to formally verify important things such as cryptographic libraries, compilers, and operating systems. Unfortunately, even experts find ITP proofs time-consuming and error-prone. That’s why it’s exciting—and very surprising!—to find that Claude Code is so good at ITP. Today, Claude Code can complete many complex proof steps independently, but it still needs a ‘project manager’ (me) to guide it through the whole formalization. But I think Claude Code points to a world where experts aren’t necessary, and theorem provers can be used by many more people.

The rest of this post digs into what Claude Code can actually do. But if you’re interested in automated reasoning or formal verification, I recommend you stop reading, go sign up for Claude Code, Gemini CLI, Aider, Codex, or some other coding agent, and try it out on a problem you know well. It’ll cost about $20 / month for something useful, and maybe $100 / month for access to a state-of-the-art model. I reckon you’ll be able to get surprising successes (and interesting failures) with about two hours of work.

(If you actually do this, email me and tell me how it went).

Interactive Theorem Proving is Hard

There’s a deep tradeoff in automated reasoning between restricting the underlying math—which tends to make reasoning easier to automate—and making it more powerful and general. For example, an SMT solver can only solve queries expressed in a simple logical language. As a result, SMT solvers are predictable enough to run a billion times a day as part of AWS’s S3 bucket security controls. No-one has to write a proof on a whiteboard; instead, the solver solves each theorem quickly and cheaply.1

At the other end of the spectrum, interactive theorem proving tools are designed to be as general as possible so we can use them for Real Math™—for example, Fermat’s Last Theorem. Unfortunately, this power makes ITP tools notoriously difficult to use. Nick Benton memorably wrote in 2006 “I have rarely felt as stupid and frustrated as I did during my first few weeks using Coq.” I taught myself Lean this year—almost twenty years after Benton. The specific issues he pointed to are long solved, but I still felt stupid and frustrated, far beyond any other kind of tool I’ve used.

Why is ITP so hard? Some reasons are quite obvious: interfaces are confusing, libraries are sparse, documentation is poor, error messages are mysterious. But these aren’t interesting problems, and anyway they are getting fixed. Lean, in particular, is improving at astonishing pace thanks to focused work by the Lean FRO.

Deeper than this, and keenly felt if not always said, ITP is cognitively demanding in multiple ways. It requires the ability to juggle abstractions as in pen-and-paper mathematics, a willingness to wrangle complex constraints in the exotic languages used by ITP tools, and a tolerance for microscopic pedantry which most people find almost impossible to summon.

... continue reading