OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

NurPhoto / Getty Images Follow ZDNET: Add us as a preferred source on Google. ZDNET's key takeaways AI's efficacy at work is still proving lukewarm at best. OpenAI's new evaluation measures its GDP impact in certain tasks. Companies are under pressure to justify their tools' existence. Despite so many AI tools flooding the market, promising increased productivity and even fully automated work, their impact so far has been inconsistent at best. As a recent MIT report noted, 95% of enterprise AI projects have failed; elsewhere, bosses are getting unsatisfactory AI-generated "workslop" from their direct reports, creating hours of additional labor -- not the intended effect. Also: AI helps strong dev teams and hurts weak ones, according to Google's 2025 DORA report OpenAI's new evaluation, GDPval, aims to change that by "measuring how AI performs on real-world, economically valuable tasks," the company said in an announcement Thursday. Companies and third-party testers already use industry benchmarks and other evaluations to determine how capable models are at tasks like coding and math. However, these can lean more academic than would be realistic once models are deployed; GDPval aims to narrow that gap between theory and practice. What GDPval measures GDPval measures how models tackle 1,320 tasks associated with 44 occupations -- mostly knowledge work jobs -- across the top nine industries that contribute more than 5% to US gross domestic product (GDP). Using data from the May 2024 US Bureau of Labor Statistics (BLS) and the Department of Labor's O*NET database, OpenAI included some expected professions, like software engineers, lawyers, and video editors, as well as some less commonly touched by AI as of now, including detectives, pharmacists, and social workers. GDPval's 44 occupations. OpenAI According to OpenAI, the tasks were created by professionals with an average of 14 years of experience in relevant fields to reflect "real work products, such as a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan," the company wrote in a blog. Also: The fastest growing AI chatbot lately? It's not ChatGPT or Gemini "Unlike other evaluations tied to economic value which concentrate on specific domains (e.g., SWE-Lancer), GDPval covers many tasks and occupations," OpenAI said. Rather than using text prompts, GDPval gives models files to reference and specifies multimodal deliverables like slides and documents to simulate what users would expect of it in a work environment. "This realism makes GDPval a more realistic test of how models might support professionals," OpenAI added. How models are performing OpenAI had experienced professionals blindly grade outputs from OpenAI's GPT-4o, o4-mini, o3, and GPT-5 models, as well as Anthropic's Claude Opus 4.1, Google's Gemini 2.5 Pro, and xAI's Grok 4. Graders unknowingly compared them with human-generated outputs. OpenAI supplemented this with an "autograder" AI system that predicts how humans will evaluate deliverables. The company said it will release the autograder as an experimental research tool here for those who want to try it, though OpenAI cautions it's not as reliable as human graders and won't be replacing them anytime soon. Also: How people actually use ChatGPT vs Claude - and what the differences tell us "We found that today's best frontier models are already approaching the quality of work produced by industry experts," OpenAI wrote. "Claude Opus 4.1 was the best performing model in the set, excelling in particular on aesthetics (e.g., document formatting, slide layout), and GPT-5 excelled in particular on accuracy (e.g., finding domain-specific knowledge)." OpenAI The research also showed performance "more than doubled from GPT-4o (released spring 2024) to GPT-5 (released summer 2025)," OpenAI added, indicating that model abilities are improving rapidly. The kicker, of course, is cost. Also: Meet Asana's new 'AI Teammates,' designed to collaborate with you "We found that frontier models can complete GDPval tasks roughly 100x faster and 100x cheaper than industry experts," OpenAI wrote. "However, these figures reflect pure model inference time and API billing rates, and therefore do not capture the human oversight, iteration, and integration steps required in real workplace settings to use our models." Caveats In the blog, OpenAI noted that GDPval is "an early step that doesn't reflect the full nuance of many economic tasks." It only conducts one-off evaluations, meaning it can't measure whether a model could complete multiple drafts of a project or successfully absorb context for an ongoing task. For example, GDPval currently can't assess whether a model could successfully edit a brief based on client feedback or redo data analysis around an anomaly. Also: I tested ChatGPT's Deep Research against Gemini, Perplexity, and Grok AI to see which is best OpenAI added the important note that work in the real world isn't always cut and dry -- not every task comes with an organized set of files or a clear directive. The human -- and deeply contextual -- work of exploring a problem through conversation and dealing with ambiguity or shifting circumstances can't be captured by something like GDPval at this stage. "Most jobs are more than just a collection of tasks that can be written down," OpenAI said. The company added that future iterations will try, though, by spanning more industries and harder-to-automate tasks, like those that involve interactive workflows or lots of prior context (something AI agents, for example, currently struggle with). OpenAI said it will release a subset of GDPval tasks for researchers to use in their own work and expand the project. What comes next OpenAI's conclusion from these results is more of what we've become used to hearing. AI will inevitably continue to disrupt the job market, as it already has, and can theoretically take on busywork to free up workers' time for more complex tasks. "Especially on the subset of tasks where models are particularly strong, we expect that giving a task to a model before trying it with a human would save time and money," OpenAI said, perhaps unsurprisingly. Also: Forget quiet quitting - AI 'workslop' is the new office morale killer Despite noting how competitive models have become with human experts, OpenAI reiterated its familiar line: that it plans to democratize access to AI tools in order to keep "supporting workers through change, and building systems that reward broad contribution." "Our goal is to keep everyone on the 'up elevator' of AI," the company wrote -- which, contradicting recent surveys, assumes everyone is having that experience to begin with.

OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising

Share this article

Related Articles