A robot is sprinting towards you. Do you want it running on Claude or Grok?

Jacky Liang · 6/4/2026

A robot is running at you. Do you want it running on Anthropic’s Claude or xAI’s Grok?

I dropped eleven LLMs into a 2D battle royale and made them play 30 games. One won 43% of the matches. Three never won a single game. The cheapest model in the lineup beat the most expensive one by 27x on cost per win.

The model that won is Grok 4.1 Fast. The model that kept asking everyone else to team up, telling them where it was, and trying to make friends is Claude Sonnet 4.6. The first one is the one that wins a battle royale. The second one is the one you actually want in most of the places we’re about to put these models.

Both of those things are true. That’s the part most benchmarks can’t see, and it’s what this post is about.

I’m Jacky, and I’ll admit it: I used to play a lot of video games like Apex Legends and PUBG. Twelve-hour days sometimes. I don’t know how I had the time, but those years shaped how I think about problems.

When I started working in AI, one question kept coming back: what happens if you drop large language models into a video game? The two I played most were Apex Legends and PUBG. I joined OpenRouter as Dev Rel Lead, which got me the token budget and access to 600+ models to actually try it.

This is the experiment I ran in my first week at OpenRouter.

And it’s changing how I pick models and see benchmarks and evaluations.

Three quick facts

... continue reading