Developers Say GPT-5 Is a Mixed Bag

When OpenAI launched GPT-5 last week, it told software engineers the model was designed to be a “true coding collaborator” that excels at generating high-quality code and performing agentic, or automated, software tasks. While the company didn’t say so explicitly, OpenAI appeared to be taking direct aim at Anthropic’s Claude Code, which has quickly become many developers’ favored tool for AI-assisted coding. But developers tell WIRED that GPT-5 has been a mixed bag so far. It shines at technical reasoning and planning coding tasks, but some say that Anthropic’s newest Opus and Sonnet reasoning models still produce better code. Depending on which version of GPT-5 developers are using—low, medium, or high verbosity—the model can be more elaborative, which sometimes leads it to generate unnecessary or redundant lines of code. Some software engineers have also criticized how OpenAI evaluated GPT-5’s performance at coding, arguing the benchmarks it used are misleading. One research firm called a graphic that OpenAI published boasting about GPT-5’s capabilities a “chart crime.” GPT-5 does stand out in at least one way: Several people noted that, in comparison to competing models, it is a much more cost-effective option. “GPT-5 is mostly outperformed by other AI models in our tests, but it’s really cheap,” says Sayash Kapoor, a computer science doctoral student and researcher at Princeton University who co-wrote the book AI Snake Oil. Kapoor says he and his team have been running benchmark tests to evaluate GPT-5’s capabilities since the model was released to the public last week. He notes that the standard test his team uses—measuring how well a language model can write code that will reproduce the results of 45 scientific papers—costs $30 to run with GPT-5 set to medium, or mid-range verbosity. The same test using Anthropic’s Opus 4.1 costs $400. In total, Kapoor says his team has spent around $20,000 testing GPT-5 so far. Although GPT-5 is cheap, Kapoor’s tests indicate the model is also less accurate than some of its competitors. Claude’s premium model achieved a 51 percent accuracy rating, measured by how many of the scientific papers it accurately reproduced. The medium version of GPT-5 received a 27 percent accuracy rating. (Kapoor has not yet run the same test using GPT-5 high, so it’s an indirect comparison, given that Opus 4.1 is Anthropic’s most powerful model.) OpenAI spokesperson Lindsay McCallum referred WIRED to its blog, where it said that it trained GPT-5 on “real-world coding tasks in collaboration with early testers across startups and enterprises.” The company also highlighted some of its internal accuracy measurements for GPT-5, which showed that the GPT-5 “thinking” model, which does more deliberate reasoning, scored highest on accuracy among all of OpenAI’s models. GPT-5 “main,” however, still fell short of previously-released models on OpenAI’s own accuracy scale. Anthropic spokesperson Amie Rotherham said in a statement that “performance claims and pricing models often look different once developers start using them in production environments. Since reasoning models can quickly use a lot of tokens while thinking, the industry is moving to a world where price per outcome matters more than price per token.”

Developers Say GPT-5 Is a Mixed Bag

Share this article

Related Articles