Human Judgment as a Specification

Posted on 09 June 2026.

The rise of GenAI in programming clearly requires an accompanying rise in formal methods, to confirm that AI systems running wild are producing the solutions we actually want. That in turn requires that we specify what we want. This specification is necessarily mathematical, to take advantage of the formal methods tools. But most programmers know far less about formal specification than they do about programming. What can they do?

Getting Specifications the Bad Way

The key problem we’re tackling is: how do we go from the informal (usually prose) to the formal. A natural solution is: use LLMs to translate prose into the formal specifications. On the one hand, this is not absurd: LLMs can do a fairly good job at generating terms in many contemporary formal notations. Here’s Ron Minsky, tongue-in-cheek:

I wonder if a more plausible model is, you go to your large language model and say, ‘Please write me a specification for a function that sorts a list.’ And then it, like, spits something out. And then you look at it and think, yeah, that seems about right.

Richard Eisenberg’s response to this gets to the heart of the matter: How can we be sure that the generated specification is the right one? The human may have just plain been wrong. They may have been obviously wrong, or they could have been wrong in subtle ways. They may have been been ambiguous, and the LLM may have taken the wrong interpretation. There may also be common misconceptions about the language (which may then also be embedded in the language models). Or they may have been referring to things for which there is no clear ground truth, or the only truth is in their head: what they mean depends on their context. None of these problems is fixable by an LLM alone.

Humans in the Loop

We therefore think it’s important for humans to be in the loop while formalizing specifications. A true vibe-coder, by definition, isn’t going to care. Instead, we want to target the responsible programmer: they care about their work quality, but they are also human, i.e., busy, lazy, and so on. What can we do to help them?

We believe any solution should have two key characteristics:

It must be meaningful. Asking humans to pass judgment on complex and abstract statements is unlikely to be effective. Laziness, automation bias, inability to form good judgments, and a desire to get things done will all lead to meaningless confirmation. It must be moderate. Asking lots of questions, no matter how simple, can be exhausting and will also lead to errors as the number of questions grows. We should try to make every human action be highly impactful and not ask users to perform too many actions.

... continue reading