Gemini 3 Pro vs. 2.5 Pro in Pokemon Crystal

A little over two weeks ago I wrote up my first impressions of Gemini 3 Pro Preview inside the Gemini Plays Pokemon harness. The very next day, I spun up a head to head race on stream: Gemini 3 Pro vs Gemini 2.5 Pro, both playing Pokemon Crystal inside the exact same setup.

Fast forward two weeks. Gemini 3 Pro became the Johto Champion without losing a single battle. Gemini 2.5 Pro inched towards the 4th badge, but spent a significant amount of time looping and effectively trapped in the Olivine Lighthouse before finally breaking out.

On paper, this was a fair fight. In practice, Gemini 3 Pro behaved like a different species of agent.

This post walks through how the race was set up, what actually happened moment to moment, and what it suggests about the gap between Gemini 3 Pro and 2.5 Pro as long horizon game-playing agents.

Setup: same harness, same rules

Both models ran in the same Gemini Plays Pokemon harness. No special casing, no hidden helpers for one model and not the other. The harness exposes a set of tools that any LLM running inside it can choose to use:

Mental Map: automatically tracks where the agent has explored, filling in fog of war as new tiles are revealed. It does not read map layout directly from RAM; it only updates based on tiles that have actually been visible on screen.

Notepad: a scratchpad for objectives, future plans, and puzzle progress, including hypotheses, failures, and successes.

Map Markers: persistent markers for points of interest such as NPCs or building entrances.

Code Execution: a way to run one-off snippets such as a pathfinding routine.

... continue reading