When AI Writes the Software, Who Verifies It?

When AI Writes the World’s Software, Who Verifies It?

AI Is Rewriting the World’s Software

Code Metal recently raised $125 million to rewrite defense industry code using AI. Google and Microsoft both report that 25–30% of their new code is AI-generated. AWS used AI to modernize 40 million lines of COBOL for Toyota. Microsoft’s CTO predicts that 95% of all code will be AI-generated by 2030. The rewriting of the world’s software is not coming. It is underway.

Anthropic recently built a 100,000-line C compiler using parallel AI agents in two weeks, for under $20,000. It boots Linux and compiles SQLite, PostgreSQL, Redis, and Lua. AI can now produce large-scale software at astonishing speed. But can it prove the compiler correct? Not yet.

No one is formally verifying the result.

Andrej Karpathy described the pattern: “I ‘Accept All’ always, I don’t read the diffs anymore.” When AI code is good enough most of the time, humans stop reviewing carefully. Nearly half of AI-generated code fails basic security tests, and newer, larger models do not generate significantly more secure code than their predecessors. The errors are there. The reviewers are not. Even Karpathy does not trust it: he later outlined a cautious workflow for “code [he] actually care[s] about,” and when he built his own serious project, he hand-coded it.

Consider what happens at scale. A single bug in OpenSSL — Heartbleed — exposed the private communications of millions of users, survived two years of code review, and cost the industry hundreds of millions of dollars to remediate. That was one bug, introduced by one human, in one library. AI is now generating code at a thousand times the speed, across every layer of the software stack, and the defenses we relied on (code review, testing, manual inspection) are the same ones that missed Heartbleed for two years.

The Harvard Business Review recently documented what it calls “workslop”: AI-generated work that looks polished but requires someone downstream to fix. When that work is a memo, it is annoying. When it is a cryptographic library, it is catastrophic. As AI accelerates the pace of software production, the verification gap does not shrink. It widens. Engineers stop understanding what their systems do. AI outsources not just the writing but the thinking.

The threat extends beyond accidental errors. When AI writes the software, the attack surface shifts: an adversary who can poison training data or compromise the model’s API can inject subtle vulnerabilities into every system that AI touches. These are not hypothetical risks. Supply chain attacks are already among the most damaging in cybersecurity, and AI-generated code creates a new supply chain at a scale that did not previously exist. Traditional code review cannot reliably detect deliberately subtle vulnerabilities, and a determined adversary can study the test suite and plant bugs specifically designed to evade it. A formal specification is the defense: it defines what “correct” means independently of the AI that produced the code. When something breaks, you know exactly which assumption failed, and so does the auditor.

Poor software quality already costs the U.S. economy $2.41 trillion per year, according to a 2022 study by the Consortium for Information & Software Quality. That number was calculated before AI began writing a quarter or more of new code at leading companies. Chris Lattner, the creator of LLVM and Clang, put it bluntly: AI amplifies both good and bad structure. Bad code at AI speed becomes “incomprehensible nightmares.” As AI generates an increasing share of the world’s critical infrastructure (financial systems, medical devices, defense, transportation), unverified code becomes a systemic risk, not just a quality problem.

... continue reading