Google, OpenAI, and Anthropic are competing to see whose AI can play Pokémon the best — Twitch streams of beloved RPG game test the models' true might

While innumerable benchmarks and tests exist to determine the savvy and capabilities of AI, one perhaps more obscure benchmark appears to be making waves in the AI community. According to a new report, companies like Google, OpenAI, and Anthropic are now making their models play old-school Pokémon to evaluate performance, as reported by the Wall Street Journal.

"The thing that has made Pokémon fun and that has captured the [machine learning] community’s interest is that it’s a lot less constrained than Pong or some of the other games that people have historically done this on. It’s a pretty hard problem for a computer program to be able to do," Anthropic AI lead David Hershey told the outlet.

Visual explainer on how Claude plays Pokémon (Image credit: ClaudePlaysPokémon on Twitch)

It all started last year when Claude — Anthropic's frontier LLM — was put on a Twitch stream by Hershey, dubbed "Claude Plays Pokémon." David is the applied AI lead at Anthropic, meaning his job is to help customers deploy the AI, so this is just another way of testing the models. Claude's gaming efforts have inspired freelance developers to put up similar "Gemini Plays Pokémon" and "GPT Plays Pokémon" streams, too.

These projects have received official recognition from Google and OpenAI, with their labs even stepping in to tweak the models sometimes. Such deliberation has allowed both Gemini and GPT to already beat Pokémon Blue, so they've moved on to the sequels, but no version of Claude has pulled through yet. The latest Opus 4.5 model is currently busy tackling the challenge on stream.

(Image credit: ClaudePlaysPokémon on Twitch)

David says that using Pokémon to test these AI models is quite beneficial as "it provides [us] with, like, this great way to just see how a model is doing and to evaluate it in a quantitative way." In the game, you have to level up, train your existing roster, and capture new Pokémon by beating their gym masters. It's not a simple linear progression, but one that requires judgment.

You're often met with a choice to either pursue a risk, fighting a powerful trainer to seize their Pokémon, or sharpen the skills of those you already have. Humans excel at making decisions like these; they're part of the fun, but for AI, it's a test on logical reasoning, risk assessment, and long-term thinking that will affect overall progress. Therefore, how a model chooses to play the game helps researchers understand it more.

David does so by sharing his findings with customers, improving the "harness" built around AI that targets specific tasks. Harness refers to essentially the software framework that controls a model, helps direct its resources in a more meaningful way to cater to particular task requirements. David applies what he learns from his Pokémon streams to real-world clients looking to improve their compute efficiency.

Stay On the Cutting Edge: Get the Tom's Hardware Newsletter Get Tom's Hardware's best news and in-depth reviews, straight to your inbox. Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors

... continue reading