The Percentage of Tasks AI Agents Are Currently Failing At May Spell Trouble for the Industry

It's safe to say there's a lot riding on "artificial intelligence," a buzzy and nebulous swath of the tech industry pedaling all kinds of large language model (LLM) and similar software products.

Since ChatGPT emerged in November 2022, venture capitalist investments in AI have skyrocketed, rising to $131.5 billion in 2024, an increase of 52 percent compared to 2023. In the last three months of 2024, over half of all venture capital in the world went to AI companies.

One of the flashier bits of tech attracting investors are "AI agents," which are software product designed to complete multi-part tasks on behalf of their human taskmasters. Tech companies and big corporations have spilled tankers of ink hyping up these agents, insisting they will "replace knowledge work" and bring about a "fundamental shift in how businesses operate."

But despite these lofty promises and the money behind them, there's mounting evidence that AI agents are just the latest bit of empty tech industry promises.

In May, researchers at Carnegie Mellon University released a paper showing that even the best-performing AI agent, Google's Gemini 2.5 Pro, failed to complete real-world office tasks 70 percent of the time. Factoring in partially completed tasks — which included work like responding to colleagues, web browsing, and coding — only brought Gemini's failure rate down to 61.7 percent.

And the vast majority of its competing agents did substantially worse.

OpenAI's GPT-4o, for example, had a failure rate of 91.4 percent, while Meta's Llama-3.1-405b had a failure rate of 92.6 percent. Amazon's Nova-Pro-v1 failed a ludicrous 98.3 percent of its office tasks.

Meanwhile, a recent report by Gartner, a tech consultant firm, predicts that over 40 percent of AI agent projects initiated by businesses will be cancelled by 2027 thanks to out-of-control costs, vague business value, and unpredictable security risks.

"Most agentic AI projects right now are early stage experiments or proof of concepts that are mostly driven by hype and are often misapplied," said Anushree Verma, a senior director analyst at Gartner.

The report notes an epidemic of "agent washing," where existing products are rebranded as AI agents to cash in on the current tech hype. Examples include Apple's "Intelligence" feature on the iPhone 16, which it currently faces a class action lawsuit over, and investment firm Delphia's fake "AI financial analyst," for which it faced a $225,000 fine.

... continue reading