Formal Verification Gates for AI Coding Loops

Some of the most serious software bugs are also the most boring. A user should not be able to read another tenant’s data. Nobody disagrees with this, nobody stands up in a design review to defend Alice reading Bob’s records, and yet broken access control remains the #1 category on the OWASP Top 10.

These bugs ship because the rule has been placed in the wrong part of the system. It lives in a prompt, in a review checklist, in the shared expectation that every future engineer, and now every future model invocation, will remember the invariant and reapply it correctly.

That assumption was already weak, and with AI generating most of the code, it fails outright. You can do all the obvious things: put rules in CLAUDE.md , write a careful system prompt, add “authorization IS VERY IMPORTANT” to the agent instructions, and you should do all of that. But after the model has written sixteen thousand lines, the real question remains: how do you know the code does what you wanted? Tests help, but tests are empirical. They check the cases you and the model remembered to write, and they cannot speak for the handler someone adds next week.

I want to pull a different lever. My bet, stated plainly, is this: for a wide class of production software, structural backpressure beats incremental improvements in agent intelligence. Existing models can already write almost all of your code. The limiting factor is whether you can know that they did what you wanted, and that knowledge comes from the substrate they write against, not from waiting for a smarter model.

Shen-Backpressure is the tool and methodology I built to explore that bet. I will show what it does through a running demo, and then show how to wire the same loop into your own project.

Behavioral Gates And Structural Gates

Most prompt-level constraints are behavioral gates. We tell the model “do not skip authorization,” “validate inputs,” “use the shared helper.” Models follow these instructions often enough to be useful and fail often enough to make the whole arrangement unstable. A behavioral gate depends on the model remembering the rule, recognizing where it applies, resisting the gravitational pull of local context, and then on a human reviewer maintaining the same invariant across the whole codebase.

Structural gates are different. A compiler, a type checker, a test runner, a linter, a proof checker. Each produces a concrete answer about the artifact in front of it. The answer is not perfect, but it is real, and inside its scope it refuses when the code is wrong.

That refusal is the point. It lets us move work out of the model’s instruction space and into the substrate the model is building on. Instead of spending tokens begging the model to remember an invariant, we arrange the code so the invariant is hard to violate by accident: take the property you care about most, express it in a form a machine can check, project it into the implementation, and let the loop bounce off that check until the emergent artifact satisfies it.

This is what makes backpressure, in the sense Geoff Huntley’s Ralph and the essay Don’t Waste Your Backpressure use the term, powerful. When previous errors are piped into the next iteration, a deterministic gate gives the loop something firmer than vibes to push against. That loop is no longer a niche idea: Codex CLI now ships /goal , OpenAI’s own take on the Ralph loop, keeping a goal alive across turns and refusing to stop until it is met.

... continue reading