OpenAI spills technical details about how its AI coding agent works

On Friday, OpenAI engineer Michael Bolin published a detailed technical breakdown of how the company’s Codex CLI coding agent works internally, offering developers insight into AI coding tools that can write code, run tests, and fix bugs with human supervision. It complements our article in December on how AI agents work by filling in technical details on how OpenAI implements its “agentic loop.”

AI coding agents are having something of a “ChatGPT moment,” where Claude Code with Opus 4.5 and Codex with GPT-5.2 have reached a new level of usefulness for rapidly coding up prototypes, interfaces, and churning out boilerplate code. The timing of OpenAI’s post details the design philosophy behind Codex just as AI agents are becoming more practical tools for everyday work.

These tools aren’t perfect and remain controversial for some software developers. While OpenAI has previously told Ars Technica that it uses Codex as a coding tool to help develop the Codex product itself, we also discovered, through hands-on experience, that these tools can be astonishingly fast at simple tasks but remain brittle beyond their training data and require human oversight for production work. The rough framework of a project tends to come fast and feels magical, but filling in the details involves tedious debugging and workarounds for limitations the agent cannot overcome on its own.

Bolin’s post doesn’t shy away from these engineering challenges. He discusses the inefficiency of quadratic prompt growth, performance issues caused by cache misses, and bugs the team discovered (like MCP tools being enumerated inconsistently) that they had to fix.

The level of technical detail is somewhat unusual for OpenAI, which has not published similar breakdowns of how other products like ChatGPT work internally, for example (there’s a lot going on under that hood we’d like to know). But we’ve already seen how OpenAI treats Codex differently during our interview with them in December, noting that programming tasks seem ideally suited for large language models.