Lf-lean: The frontier of verified software engineering

We present lf-lean , a verified translation of all 1,276 statements of the Logical Foundations textbook from Rocq to Lean, produced by frontier AI with ~2 person-days of human effort versus an estimated ~2.75 person-years manually (a 350x speed-up). We achieve this through task-level specification generators: because many software transformations are semantics-preserving, correctness can be defined once for an entire task class and checked automatically across all instances and codebases. This scales human oversight from 𝒪(𝓃) to 𝒪(1) regardless of program complexity. Placed on METR’s time horizon graph, our result suggests verified software engineering is advancing faster than expected.

2025 2026 0 30m 1h 1h 30m 2h 2h 30m 3h 3h 30m 4h 4h 30m 5h Early time horizon on verified software engineering Time horizon on verified software engineering Time horizon on unverified software engineering Closing the Gap Between Verified and Unverified Software Engineering Release date Task duration (human equivalent) o1 3.7 Sonnet o3 Grok 4 GPT-5 GPT-5.1 Codex Max Opus 4.5 AlphaVerus FVAPPS Verina lf-lean Verified software engineering, code automatically proven correct Unverified software engineering, bottlenecked on human review Figure 1: Closing the Gap Between Verified and Unverified Software Engineering. Adapted from METR’s Time Horizon plot, including software verification benchmarks where AIs write code and then prove it correct. We plot only the time horizon for software implementation (not verification) for an an apples-to-apples comparison of how much functionality is implemented via each method of software development. lf-lean gives us an encouraging measurement of where verified software engineering capability is.

We present lf-lean , a verified translation of all 1,276 statements of the Logical Foundations textbook from Rocq to Lean, produced by frontier AI with ~2 person-days of human effort versus an estimated ~2.75 person-years manually (a 350x speed-up). We achieve this through task-level specification generators: because many software transformations are semantics-preserving, correctness can be defined once for an entire task class and checked automatically across all instances and codebases. This scales human oversight from 𝒪(𝓃) to 𝒪(1) regardless of program complexity. Placed on METR's time horizon graph, our result suggests verified software engineering is advancing faster than expected.

Introduction

As AIs automate increasingly complex software tasks, a fundamental tension emerges: how do we know the code they produce is correct? The standard approach of reviewing AI-generated code and its tests doesn’t scale. Human review effort grows proportionally with code volume, while AI code generation capacity grows exponentially. If this trend continues, we’re headed toward a world where code deployment is bottlenecked on human review capacity, or just deployed without review.

This post introduces a different approach: task-level specification generators. Instead of reviewing each piece of generated code individually, we review a correctness specification once for an entire class of tasks, and the correctness specification covers all instances of the task across any codebase. The key insight is that for many important software transformations (translation, optimization, refactoring), we can define what “correct” means generically, without reference to any specific codebase. Moreover, as codebase complexity increases, a lot of feature implementation is a semantics-preserving transformation of existing features, so the correctness specification for the new feature is a simple corollary of the correctness specification for the existing features. Thus, building task-level specification generators helps scale oversight in two ways: (1) you can easily work on many more codebases simultaneously, and (2) what works on simple codebases extends to arbitrarily complex ones.

We prototype task-level specification generation with rocq-dove , building a scalable environment for verified translation from Rocq to Lean. Given any Rocq source code, rocq-dove automatically generates a correctness specification (a theorem that any valid translation must satisfy) and grades model-generated translations and proofs against it. Because the specification is derived from the source code itself, rocq-dove scales to arbitrary codebases without additional human effort.

We use rocq-dove to generate an environment for a verified translation of all 1,276 statements in Logical Foundations 1. Using our scaffold, frontier AI models autonomously produced verified translations of 97% of the statements. We solved 6 extreme-difficulty statements manually (~2 person-days), after which models completed the remaining 2.5%. For context, we estimate the full task would have taken 3 person-months for translation plus 2.5 person-years for verification if done entirely by humans.

Two key world-view updates:

Verified software engineering can scale without marginal human oversight. rocq-dove automatically generated correctness specifications for the 1,276 translations in lf-lean . No human judgment was needed to define “correct” for each case. Combined with rich feedback from the proof assistant, this creates a dense reward signal that lets AI iterate toward correct solutions autonomously. AI capabilities on software verification are advancing faster than expected. We developed lf-lean , the first sufficiently complex evaluation to demonstrate this. Compared to METR’s time horizon trend 2 on unverified software engineering, verified software engineering appears to be catching up.

... continue reading