Skip to content
Tech News
← Back to articles

Taming LLMs: Using Executable Oracles to Prevent Bad Code

read original get AI Code Review Tool → more articles
Why This Matters

This article highlights the importance of using executable oracles to improve the reliability of large language models (LLMs) in coding tasks. By constraining LLMs with rigorous testing and validation methods, the tech industry can develop more trustworthy AI-driven coding tools, ultimately enhancing software quality and safety for consumers.

Key Takeaways

Zero-Degree-of-Freedom LLM Coding using Executable Oracles

John Regehr, March 26 2026.

You Can’t Trust The Damn Things

By this point, most of us who have experimented with Claude, Codex, and other LLM-based coding agents have noticed that the current generation of these can sometimes do good work, at superhuman speed, when given some kinds of highly constrained tasks. For example, coding agents can eat a large, tricky API—such as the one for manipulating LLVM IR—for lunch, and they’ve also given me a number of fixes to non-trivial bugs in real software that could be applied as-is. On the other hand, these same tools frequently fall over in baffling ways, emitting tasteless or nonsensical code.

When an LLM has the option of doing something poorly, we simply can’t trust it to make the right choices. The solution, then, is clear: we need to take away the freedom to do the job badly. The software tools that can help us accomplish this are executable oracles. The simplest executable oracle is a test case—but test cases, even when there are a lot of them, are weak. Consider Claude’s C Compiler, which I wrote about earlier: even after passing GCC’s “torture test suite” and more, it still had 34 nasty miscompilation bugs that were within easy reach. But it wouldn’t have had those bugs if Csmith and YARPGen had been included in the testing loop that was used to bring up this compiler. These tools are better executable oracles because each of them implicitly encodes a vast collection of test cases.

This piece is about collapsing as many failure-producing degrees of freedom as possible. Zero degrees of freedom is aspirational, but a good aspiration.

Some Example Scenarios

Besides the miscompilations, Claude’s C Compiler also fell over in terms of quality of generated code. The compiler contains a somewhat elaborate (and plausible-looking) set of optimization passes, but they appear to make very little difference in the quality of its output. But what if the human overseeing the creation of this compiler had included an executable oracle for code quality into its testing loop? Well, I’m 100% speculating here, but my educated guess is that Claude would have been able to incorporate this feedback, and would have done a significantly better job optimizing the generated code. What would this oracle actually look like? I probably would have kept it simple—perhaps a count of the number of instructions that get executed, when you run the compiler’s output. I’d also have given it a baseline such as gcc -O0 , so that the LLM would know where there was the most room for improvement.

Summary: The LLM was given a degree of freedom with respect to the quality of CCC’s output, and consequently it did a poor job there.

My group (in collaboration with other folks) is working towards automated synthesis of dataflow transfer functions, such as those used by LLVM’s “known bits” analysis. We wrote a paper about this, where we used randomized synthesis techniques, no LLMs. Recently I asked Codex to start writing transfer functions. By itself, it’s not bad at this, but not great. However, given access to our command-line tools for evaluating the precision and verifying the soundness of a transfer function, Codex produced results that are better than anything I’ve seen either in a real compiler like LLVM, or in our own randomized synthesis results. The remaining degree of freedom that I left Codex—code size—allowed it to write pretty large transfer functions that explore some pretty deep case splits on the input structure, but capping the size of generated code is the easiest thing.

... continue reading