OpenAI Says ChatGPT Can Already Do Some Work Tasks as Well as Humans

OpenAI is trying to make the case that AI can actually be useful at work, as some recent studies have shown that companies aren’t getting much out of their AI investments. On Tuesday, the ChatGPT-maker released a report introducing a new benchmark for testing AI on “economically valuable, real-world tasks” across 44 different jobs. The evaluation is called GDPval, and OpenAI says it’s meant to ground workplace AI debates in evidence rather than hype—and track how models improve over time. It comes on the heels of a recent MIT Media Lab study that found fewer than one in ten AI pilot projects delivered measurable revenue gains and warned that “95 percent of organizations are getting zero return” on their AI bets. And just last week, researchers from Harvard Business Review’s BetterUp Labs and Stanford’s Social Media Lab blamed “workslop” for the lackluster results. They define workslop as “AI-generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task.” OpenAI argues that GDPval fills a gap left by existing benchmarks, which typically test AI models on abstract academic problems rather than the kinds of day-to-day tasks people actually do at work. What GDPval measures “We call this evaluation GDPval because we started with the concept of Gross Domestic Product (GDP) as a key economic indicator and drew tasks from the key occupations in the industries that contribute most to GDP,” OpenAI wrote in a blog post announcing the report. The first version of the benchmark spans 44 jobs across the nine industries that make up the largest share of U.S. GDP, including real estate, government, manufacturing, and finance. Within each sector, OpenAI zeroed in on roles that drive the highest wages and compensation, focusing on what they called knowledge work. To build the test set, OpenAI recruited professionals from those industries, averaging 14 years of experience, to design real-world tasks. Each expert also created a human-written example of how the task should be done. Example assignments include drafting a legal brief, producing an engineering blueprint, handling a customer support exchange, or writing a nursing care plan. The report contains 30 fully reviewed tasks per occupation, plus a smaller “gold set” of five open-sourced tasks per occupation. To measure performance, OpenAI used expert graders, professionals from the same fields represented in the dataset. These professionals blindly graded the AI-generated deliverables with those produced by task writers and offered critiques and rankings. They then ranked each better, as good as, or worse than one another. What GDPval found The report found that today’s top AI models are already closing in on the quality of work produced by human experts. In tests on 220 tasks from the GDPval gold set, evaluators compared deliverables from seven leading models against industry professionals. Claude Opus 4.1 came out on top getting a 47.6% win and tie rate against human-completed tasks. It was especially strong on aesthetics like document formatting and slide layout. GPT-5 high came in second with a win and tie rate of 38.8%. Its strength was accuracy like carefully following instructions and performing correct calculations. GPT-4o was in last place with a win and tie rate of only 12.4% The AI models performed particularly well on tasks from occupations like counter and rental clerks; shipping, receiving, and inventory clerks; sales managers; and software developers. They struggled more with tasks from occupations such as industrial engineers, medical engineers, pharmacists, financial managers, and video editors. For example, Claude Opus 4.1 had its highest win and tie rate with tasks done by counter and rental clerks (81%), followed by shipping, receiving, and inventory clerks (76%). Its lowest scores were for tasks performed by industrial engineers and film and video editors (both 17%), and by audio and video technicians (2%). OpenAI also claims these models can knock out GDPval tasks around 100 times faster and 100 times cheaper than human experts. Still, OpenAI stressed that even as AI reshapes the job market, it won’t be able to completely replace humans. As the company put it, “most jobs are more than just a collection of tasks that can be written down.” “GDPval highlights where AI can handle routine tasks so people can spend more time on the creative, judgment-heavy parts of work,” OpenAI wrote.

OpenAI Says ChatGPT Can Already Do Some Work Tasks as Well as Humans

Share this article

Related Articles