Show HN: A real-time strategy game that AI agents can play

LLM Skirmish An Adversarial In-Context Learning Benchmark Where LLMs Write Code to Battle

TL;DR LLM Skirmish is a benchmark where LLMs play 1v1 RTS (real-time strategy) games against each other

LLMs write their battle strategies in code, which is then executed in the game environment

LLM Skirmish tests in-context learning, as each tournament lasts five rounds and LLMs are able to alter strategies between rounds

Introduction It's been great to see the energy in the last year around using games to evaluate LLMs. Yet there's a weird disconnect between frontier LLMs one-shotting full coding projects and those same models struggling to get out of Pokemon Red's Mt. Moon. We wanted to create an LLM game benchmark that put this generation of frontier LLMs' superpower, coding, on full display. Ten years ago, a team released a game called Screeps. It was described as an "MMO RTS sandbox for programmers." In Screeps, human players write javascript strategies that get executed in the game's environment. Players gain resources, lose territory, and have units wiped out. It's a traditional RTS, but controlled entirely through code. The Screeps paradigm, writing code and having it execute in a real-time game environment, is well suited for an LLM benchmark. Drawing on a version of the Screeps open source API, LLM Skirmish pits LLMs head-to-head in a series of 1v1 real-time strategy games.

Overall Standings # Model Wins W Losses L Win % % ELO 1 Claude Opus 4.5 85 15 85% 1778 2 GPT 5.2 68 32 68% 1625 3 Grok 4.1 Fast 39 61 39% 1427 4 GLM 4.7 32 68 32% 1372 5 Gemini 3 Pro 26 74 26% 1297

Objective In LLM Skirmish, each player begins with a "spawn" (a building that can create units), one military unit, and three economic units. The objective of each LLM Skirmish match is to eliminate your opponent's spawn. If a player is not eliminated within 2,000 game frames (each player is allowed up to one second of runtime computation per frame), the game ends and the victor is determined based on score.

Tournament Setup Every LLM Skirmish tournament consists of five rounds. In each round, each LLM is asked to write a script implementing its strategy. For all rounds after the first, each LLM can see the results of all its matches from the previous round and use that information to make changes to the script it submits for the next round. In every round, every player plays all other players once. This means there are 10 matches per round and 50 matches per tournament.

Agent Setup LLM Skirmish was conducted using OpenCode, an open source general purpose agentic coding harness. OpenCode was selected because it was not designed for any of the evaluated models and is fully open source to aid in replicability. Each LLM agent runs in an isolated Docker container with OpenCode providing the coding environment. The orchestrator coordinates the tournament by sending prompts to each agent, which then uses OpenCode's tools (file editing, shell commands, etc.) to write and submit their game scripts. Prompt Structure At the start of each round, agents receive OBJECTIVE.md (the game rules, API documentation, and instructions for writing a game script) and NEXT_ROUND.md (instructions for reviewing match logs from the previous round, rounds 2-5 only). Agents are also provided with two example strategies as reference. Script Validation After each agent creates their strategy, the orchestrator validates the script. If validation fails, the agent receives the error message and has up to 3 attempts to fix the issue before the round proceeds.

In-context Learning LLM Skirmish tests in-context learning, as each tournament lasts five rounds and models are able to alter strategies between rounds. One would hypothesize that if a model is successfully learning in context, scripts written after seeing previous results (as in rounds 2–5) would be of higher quality compared to scripts written in round 1. Across all tournaments, each model submits 25 scripts for a total of 250 matches. In a tournament, we consider each model to be a player. If we treat each script as a player and have all scripts play against each other, we can simulate 7,750 matches to get a robust per-round average win rate (a proxy for script quality). Script Round vs Performance We can see that four of the five models evaluated have notable increases in average win rate between round 1 and round 5 (Claude Opus 4.5 +20%, GLM 4.7 +16%, GPT 5.2 +7%, Grok 4.1 Fast +6%). Gemini 3 Pro Performance Gemini 3 Pro's performance presents an anomaly. Its round 1 average win rate was 70% (higher than all four other evaluated models), while its round 2-5 average win rate was 15% (lower than all four other evaluated models). Gemini 3 Pro's round 1 scripts are approximately four times shorter than those of top-performing models Claude 4.5 Opus and GPT 5.2. A qualitative review of Gemini 3 Pro's scripts suggests it had success with simplistic strategies in round 1. In rounds 2-5, compared to the other four models evaluated, Gemini 3 Pro most aggressively populated its context with previous round results before submitting its script for that round, suggesting that context rot was a notable contributor to the performance variance. Whether this context rot reflects other models being better at planning tool use than Gemini 3 Pro, or whether OpenCode is a uniquely inhospitable harness for Gemini 3 Pro, is worth investigating further in future versions of LLM Skirmish.

... continue reading