AI benchmarks are a bad joke – and LLM makers are the ones laughing

AI companies regularly tout their models' performance on benchmark tests as a sign of technological and intellectual superiority. But those results, widely used in marketing, may not be meaningful.

A study [PDF] from researchers at the Oxford Internet Institute (OII) and several other universities and organizations has found that only 16 percent of 445 LLM benchmarks for natural language processing and machine learning use rigorous scientific methods to compare model performance.

What's more, about half the benchmarks claim to measure abstract ideas like reasoning or harmlessness without offering a clear definition of those terms or how to measure them.

In a statement, Andrew Bean, lead author of the study said, "Benchmarks underpin nearly all claims about advances in AI. But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to."

When OpenAI released GPT-5 earlier this year, the company's pitch rested on a foundation of benchmark scores, such as those from AIME 2025, SWE-bench Verified, Aider Polyglot, MMMU, and HealthBench Hard.

These tests present AI models with a series of questions and model makers strive to have their bots answer as many as possible. The questions or challenges vary depending upon the focus of the test. For a math-oriented benchmark like AIME 2025, AI models are asked to answer questions like:

Find the sum of all positive integers $n$ such that $n+2$ divides the product $3(n+3)(n^2+9)$.

"[GPT-5] sets a new state of the art across math (94.6 percent on AIME 2025 without tools), real-world coding (74.9 percent on SWE-bench Verified, 88 percent on Aider Polyglot), multimodal understanding (84.2 percent on MMMU), and health (46.2 percent on HealthBench Hard)—and those gains show up in everyday use," OpenAI said at the time. "With GPT‑5 pro's extended reasoning, the model also sets a new SOTA on GPQA, scoring 88.4 percent without tools."

But, as noted in the OII study, "Measuring what Matters: Construct Validity in Large Language Model Benchmarks," 27 percent of the reviewed benchmarks rely on convenience sampling, meaning that the sample data is chosen for the sake of convenience rather than using methods like random sampling or stratified sampling.

"For example, if a benchmark reuses questions from a calculator-free exam such as AIME," the study says, "numbers in each problem will have been chosen to facilitate basic arithmetic. Testing only on these problems would not predict performance on larger numbers, where LLMs struggle."

... continue reading