Toward automated verification of unreviewed AI-generated code

Toward automated verification of unreviewed AI-generated code 2026-03-16

I've been wondering what it would take for me to use unreviewed AI-generated code in a production setting.

To that end, I ran an experiment that has changed my mindset from "I must always review AI-generated code" to "I must always verify AI-generated code." By "review" I mean reading the code line by line. By "verify" I mean confirming the code is correct, whether through review, machine-enforceable constraints, or both.

I had a coding agent generate a solution to a simplified FizzBuzz problem. Then, I had it iteratively check its solution against several predefined constraints:

(1) The code must pass property-based tests (see Appendix B for a primer). This constrains the solution space to ensure the requirements are met. This includes tests verifying that no exceptions are raised and tests verifying that latency is sufficiently low.

(2) The code must pass mutation testing (see Appendix C for a primer). Mutation testing is typically used to expand your test suite. However, if we assume our tests are correct, we can instead use it to restrict the code. This constrains the solution space to ensure that only the requirements are met.

(3) The code must have no side effects.

(4) Since I'm using Python, I also enforce type-checking and linting, but a different programming language might not need those checks.

These checks seem sufficient for me to trust the generated code without looking at it. The remaining space of invalid-but-passing programs exists, but it's small and hard to land in by accident.

I was concerned that the generated code would be unmaintainable. However, I'm starting to think that maintainability and readability aren't relevant in this context. We should treat the output like compiled code.

... continue reading