Book: The Emerging Science of Machine Learning Benchmarks

Machine learning turns on one simple trick: Split your data into training and test sets. Anything goes on the training set; rank the models on the test set. Let the model builders compete. Call this a benchmark.

Machine learning researchers cherish a good tradition of lamenting the shortcomings of machine learning benchmarks. Critics argue that static test sets and metrics promote narrow research objectives, stifling more creative scientific pursuits. Benchmarks also incentivize gaming the metrics, leading to inflated scores. Goodhart’s law cautions against competing over statistical measurements, but benchmarking ignores the warning. Over time, critics say, researchers overfit to benchmark datasets, building models that exploit artifacts. As a result, test set performance draws a skewed picture of model capabilities, deceiving us especially when comparing humans and machines. Add to this a slew of reasons why things don’t transfer from benchmarks to the real world.

These scorching critiques go hand in hand with ethical objections. Benchmarks reinforce and perpetuate biases in our representation of people, social relationships, culture, and society. Worse, the creation of massive human-annotated datasets extracts labor from a marginalized workforce excluded from the economic gains it enables.

All of this is true.

Many have said it well. The critics have argued it convincingly. I’m particularly drawn to the claim that benchmarks serve industry objectives, giving big tech labs a structural advantage. The case against benchmarks is clear, in my view.

What’s far less clear is the scientific case for benchmarks.

It’s undeniable that benchmarks have been successful as a driver of progress in the field. ImageNet was inseparable from the deep learning revolution of the 2010s, with companies competing fiercely over the best dog breed classifiers. The difference between a Blenheim Spaniel and a Welsh Springer became a matter of serious rivalry. A decade later, language model benchmarks reached geopolitical significance in the global competition over artificial intelligence. Tech CEOs recite the company’s number on MMLU—a set of college-level multiple-choice questions—in presentations to shareholders. News that DeepSeek’s R1 beat OpenAI’s o1 on some challenging reasoning benchmarks launched a frenzy that shook global stock markets.

Benchmarks come and go, but their centrality hasn’t changed. Competitive leaderboard climbing has been the main way machine learning advances.

If we accept that progress in artificial intelligence is real, we must also accept that benchmarks have, in some sense, worked. But the fact that benchmarks worked is more of a hindsight observation than a scientific lesson. Benchmarks emerged in the early days of pattern recognition. They followed no scientific principles. To the extent that benchmarks had any theoretical support, that theory was readily invalidated by how people used benchmarks in practice. Statistics prescribed locking test sets in a vault, but machine learning practitioners did the opposite. They put them on the internet for everyone to use freely. Popular benchmarks draw millions of downloads and evaluations as model builders incrementally compete over better numbers.

Benchmarks are the mistake that made machine learning. They shouldn’t have worked and, yet, they did. In this book, my goal is to shed light on why benchmarks work and what for.

... continue reading