Study identifies weaknesses in how AI systems are evaluated

A new study led by the Oxford Internet Institute (OII) at the University of Oxford and involving a team of 42 researchers from leading global institutions including EPFL, Stanford University, the Technical University of Munich, UC Berkeley, the UK AI Security Institute, the Weizenbaum Institute, and Yale University, has found that many of the tests used to measure the capabilities and safety of large language models (LLMs) lack scientific rigour.

In Measuring What Matters: Construct Validity in Large Language Model Benchmarks, accepted for publication in the upcoming NeurIPS conference proceedings, researchers review 445 AI benchmarks – the standardised evaluations used to compare and rank AI systems.

The researchers found that many of these benchmarks are built on unclear definitions or weak analytical methods, making it difficult to draw reliable conclusions about AI progress, capabilities or safety.

“Benchmarks underpin nearly all claims about advances in AI,” says Andrew Bean, lead author of the study. “But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to.”

Benchmarks play a central role in how AI systems are designed, deployed, and regulated. They guide research priorities, shape competition between models, and are increasingly referenced in policy and regulatory frameworks, including the EU AI Act, which calls for risk assessments based on “appropriate technical tools and benchmarks.”

The study warns that if benchmarks are not scientifically sound, they may give developers and regulators a misleading picture of how capable or safe AI systems really are.

“This work reflects the kind of large-scale collaboration the field needs,” adds Dr. Adam Mahdi. “By bringing together leading AI labs, we’re starting to tackle one of the most fundamental gaps in current AI evaluation.”

Key findings

Lack of statistical rigour

Only 16% of the reviewed studies used statistical methods when comparing model performance. This means that reported differences between systems or claims of superiority could be due to chance rather than genuine improvement.

... continue reading