LMArena is a cancer on AI

Would you trust a medical system measured by: which doctor would the average Internet user vote for?

No?

Yet that malpractice is LMArena.

The AI community treats this popular online leaderboard as gospel. Researchers cite it. Companies optimize for it and set it as their North Star. But beneath the sheen of legitimacy lies a broken system that rewards superficiality over accuracy.

It's like going to the grocery store and buying tabloids, pretending they're scientific journals.

The Problem: Beauty Over Substance

Here's how LMArena is supposed to work: enter a prompt, evaluate two responses, and mark the best. What actually happens: random Internet users spend two seconds skimming, then click their favorite.

They're not reading carefully. They're not fact-checking, or even trying.

This creates a perverse reward structure. The easiest way to climb the leaderboard isn't to be smarter; it’s to hack human attention span. We’ve seen over and over again in the data, both from datasets that LMArena has released and the performance of models over time, that the easiest way to boost your ranking is by:

Being verbose. Longer responses look more authoritative!

... continue reading