Crowdsourced AI benchmarks have serious flaws, some experts say
Published on: 2025-04-23 03:30:00
AI labs are increasingly relying on crowdsourced benchmarking platforms such as Chatbot Arena to probe the strengths and weaknesses of their latest models. But some experts say that there are serious problems with this approach from an ethical and academic perspective.
Over the past few years, labs including OpenAI, Google, and Meta have turned to platforms that recruit users to help evaluate upcoming models’ capabilities. When a model scores favorably, the lab behind it will often tout that score as evidence of a meaningful improvement.
It’s a flawed approach, however, according to Emily Bender, a University of Washington linguistics professor and co-author of the book “The AI Con.” Bender takes particular issue with Chatbot Arena, which tasks volunteers with prompting two anonymous models and selecting the response they prefer.
“To be valid, a benchmark needs to measure something specific, and it needs to have construct validity — that is, there has to be evidence that the constru
... Read full article.