Tech News
← Back to articles

Amazon’s bet that AI benchmarks don’t matter

read original related products more articles

Posts from this author will be added to your daily email digest and your homepage feed.

This is an excerpt of Sources by Alex Heath, a newsletter about AI and the tech industry, syndicated just for The Verge subscribers once a week.

Amazon’s AI chief has a message for the model benchmark obsessives: Stop looking at the leaderboards.

“I want real-world utility. None of these benchmarks are real,” Rohit Prasad, Amazon’s SVP of AGI, told me ahead of today’s announcements at AWS re:Invent in Las Vegas. “The only way to do real benchmarking is if everyone conforms to the same training data and the evals are completely held out. That’s not what’s happening. The evals are frankly getting noisy, and they’re not showing the real power of these models.”

It’s a contrarian stance when every other AI lab is quick to boast about how their new models quickly climb the leaderboards. It’s also convenient for Amazon, given that the previous version of Nova, its flagship model, was sitting at spot 79 on LMArena when Prasad and I spoke last week. Still, dismissing benchmarks only works if Amazon can offer a different story about what progress looks like.

“They’re not showing the real power of these models.”

The centerpiece of today’s re:Invent announcements is Nova Forge, a service that Amazon claims lets companies train custom AI models in ways previously impossible without spending billions of dollars. The problem Forge addresses is real. Most companies trying to customize AI models face three bad options: fine-tune a closed model (but only at the edges), train on open-weight models (but without the original training data and risking capability regression, where the AI becomes an expert on new data but forgets original, broader skills), or build a model from scratch at enormous cost.

Forge offers something else: access to Amazon’s Nova model checkpoints at the pre-training, mid-training, and post-training stages. Companies can inject their proprietary data early in the process, when the model’s “learning capacity is highest,” as Prasad put it, rather than just tweaking model behavior at the end.

“What we have done is democratize AI and frontier model development for your use cases at fractions of what it would cost [before],” Prasad said. Forge was created because Amazon’s internal teams wanted a tool to inject their domain expertise into a base model without having to build from scratch.

“We built Forge because our internal teams wanted Forge,” he said. It’s a familiar Amazon pattern. AWS itself famously began as infrastructure built for Amazon’s own retail operation before becoming the company’s profit engine.

... continue reading