MTG Bench: Testing how well LLMs can play Magic

How the benchmark works The main idea is that if an LLM is smart enough to play good magic, then it is also smart enough to not need a rules engine. A rules engine that enforces legal actions would improve the performance floor, but I don't think it would improve the overall quality of the simulation. Each LLM call has access to an MCP server with primitive library operations. It can do things like draw a card from the top of the deck, return card to bottom of deck, and shuffle. To simulate more advanced operations, like scry or surveil, it can use multiple library tool calls. Everything other than the library is managed by the LLM. Legality checks and scoring for the benchmarks was all done with gpt-5.5 (medium). From my testing, LLMs were much better at evaluating if a simulated turn was legal than they were at actually performing a legal turn simulation.

Why I choose to use an MCP server I have full control over all of the data and the LLM api calls, so why use MCP instead of basic function/tool calling? The main reason is that OpenAI and Anthropic allow you to provide a remote MCP server url in an api request. This means that OpenAI or Anthropic handle the agent loop. This has two major benefits. Since it is one api call, you don't pay for the cached input token cost after each tool use (at least with OpenAI. more on that later) You can use the batch api for 50% savings without having to submit a new batch after every tool call

Input token caching In my opinion, the way cached input tokens are charged does not make sense for agent loops. The pricing makes sense for independent requests. If multiple independent api calls start with the same large system prompt, input caching gets you a discount for free, or for a small caching fee. With an agent loop, however, you are charged the cached input cost for a large system prompt after every tool call. Consider an example. Assume the system prompt is already cached and tool calls result in negligible token use. Large system prompt = 10k tokens

Agent calls 10 tool functions (not parallel)

Billed cached input tokens = 10k + 10k * 10 = 110k tokens I don't think it makes sense to charge for the system prompt after every agent turn if the LLM is only pausing for a fraction of a second while waiting for a tool function result. This is overlooking some details, like how it takes output tokens to call a tool, and the tool function result still needs to be processed as input tokens. But in my case, the api cost is dominated by the large system prompt being charged as cached input tokens after every agent turn. The pricing for an agent loop is understandable when your application code has the agent loop, and is making a new api call after each tool call. But it makes even less sense when you provide a remote MCP server and do not handle the agent loop yourself. OpenAI handles it correctly. A single api call to OpenAI with a remote MCP server will only ever charge you for the input prompt once. An Anthropic api call with remote MCP server, however, works like the previous example. Some real numbers, the gpt-5.5 (medium) benchmark had an average input tokens per magic turn of 11,386. The average for claude-fable-5 (medium) was 51,610.

Over eager tool calling This benchmark punishes models that are too eager to call tools more than most benchmarks. In many cases, tool calls are only retrieving information, so if a model calls too many tools, the only downside is wasted input tokens and context window for the tool results. Even if the tool mutates state, it can usually be undone so the final result is correct. This is not the case when simulation magic. If you draw a card, then realize that was a mistake, you can't just put it back. Even if you do return the card to the deck, you now know what that card is, so the simulation is illegal. A common failure mode was the model starting a tool call, then realizing it was a mistake and having no way to correct it. All the library MCP functions have a required reason field. If you look at this example from Opus 4.8, you can see that it draws a card for turn with reason "Draw for turn", then returns the card to the deck with reason "No-op check not needed; cancel". It then proceeds to return a card named "x" to the deck with reason "noop", then again with reason "stop".