A New Paper Tested AI’s Ability to Do Actual Online Freelance Work, and the Results Are Damning

Your pesky remote freelancers demanding more money as inflation soars? You could try replacing them with AI agents instead — but it probably won’t work out well.

New research highlighted by Wired shows how these AI models designed to automate tasks — if not entire jobs — turn out to be incredibly unproductive compared to the humans they’re replacing.

Conducted by researchers at the nonprofit Center for AI Safety (CAIS) and the massive data annotation firm Scale AI, whose army of freelancers performs much of the grunt work underpinning the AI industry, the tests involved giving six leading AI agents various simulated freelance tasks.

The outcome of those tests, detailed in a new paper, was damning. Not a single AI agent was able to perform more than 3 percent of the work, making just $1,810 out of a possible $143,991.

“I should hope this gives much more accurate impressions as to what’s going on with AI capabilities,” DAIS director Dan Hendrycks told Wired.

For the tests, the researchers developed their own benchmark called the Remote Labor Index, which uses a wide range of real-world remote projects to evaluate the bots’ ability to perform economically valuable work in industries ranging from game development to data analysis.

The top performer, they found, was an AI agent from the Chinese startup Manus with an automation rate of just 2.5 percent, meaning it was only able to complete 2.5 percent of the projects it was assigned at a level that would be acceptable as commissioned work in a real-world freelancing job, the researchers said.

Second place was a tie, at 2.1 percent, between Elon Musk’s Grok 4 and Anthropic’s Claude Sonnet 4.5, which the company claims is the “best coding model in the world” and the “strongest model for building complex agents.”

OpenAI’s newest GPT-5 model and its purported “PhD level” intelligence came next at 1.7 percent. CEO Sam Altman has claimed that GPT-5 is a “significant step along the path to AGI,” or artificial general intelligence, a hypothetical AI system that most define as exceeding human cognitive abilities in virtually all aspects. (OpenAI considers AGI to be “highly autonomous systems that outperform humans at most economically valuable work,” something that the RLI benchmark shows GPT-5 is nowhere close to doing.)

Ironically, OpenAI’s actual AI agent, with the exciting brand name of ChatGPT Agent, was the second worst performer of the whole bunch, barely cracking 1.3 percent. But the absolute bottom of the barrel choice turned out to be Google’s Gemini 2.5 Pro, with a dismal 0.8 percent showing.

... continue reading