Can LLMs model real-world systems in TLA+?

Editors’ note: AI has been actively pushing the frontier of applied formal methods for computing systems. In this article, the Specula team wrote about their experience of evaluating LLMs on modeling system code, the basic capability for agentic model checking, using TLA+, a specification language for concurrent and distributed systems. The article is the 7th blog in The Next Horizon of System Intelligence series.

Several months ago, we asked Claude to write a TLA+ specification (spec) for Etcd’s Raft implementation. It passed syntax checks, ran through the TLC model checker, and at first glance looked like a polished formal model. Then we noticed something: this looks like the spec from the Raft paper, and it has very little to do with Etcd-specific details. What Claude produced was not a spec for Etcd. It was a spec from the appendix of the Raft paper. Later, we kept coming back to the question: how do we tell whether the AI is faithfully modeling a computing system, or just reciting the system’s reference paper?

This question gets harder to answer as Large Language Models (LLMs) keep improving. LLMs have seen nearly every TLA+ example online. Asking one to “write a Raft spec” is almost the same as asking it to recall something from training. Asking it to “write Etcd’s spec”, a spec that reflects how Etcd actually decomposes its atomic actions and evolves its state, is a different problem entirely. It tests whether the LLMs can abstract logic out of a complex implementation and turn that abstraction into a correct formal model.

SysMoBench is our attempt to differentiate the two via an automated benchmark.

What is SysMoBench?

SysMoBench provides eleven systems to an LLM and automatically evaluates the TLA+ specs it generates.

Table 1: Systems used as artifacts in SysMoBench

The eleven systems span concurrent synchronization and distributed protocols. For each task, we provide source code, a trace-collection harness, and an invariant template.

Figure 1: Overview of SysMoBench

Evaluation runs in four phases:

... continue reading