OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied
Published on: 2025-04-20 21:19:26
A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model is raising questions about the company’s transparency and model testing practices.
When OpenAI unveiled o3 in December, the company claimed the model could answer just over a fourth of questions on FrontierMath, a challenging set of math problems. That score blew the competition away — the next-best model managed to answer only around 2% of FrontierMath problems correctly.
“Today, all offerings out there have less than 2% [on FrontierMath],” Mark Chen, chief research officer at OpenAI, said during a livestream. “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re able to get over 25%.”
As it turns out, that figure was likely an upper bound, achieved by a version of o3 with more computing behind it than the model OpenAI publicly launched last week.
Epoch AI, the research institute behind FrontierMath, released results of its independent benchmark tests of o3 on Frid
... Read full article.