Chess is a game of perfect information. The real world is not.
Last year, Google DeepMind partnered with Kaggle to launch Game Arena, an independent, public benchmarking platform where AI models compete in strategic games. We started with chess to measure reasoning and strategic planning. But in the real world, decisions are rarely based on complete information.
To build artificial intelligence capable of navigating this uncertainty, we need benchmarks that measure the model’s ability to reason in the face of ambiguity. This is why we are now expanding Game Arena with two new game benchmarks — Werewolf and poker — to test frontier models on social dynamics and calculated risk.
Games have always been a core part of Google DeepMind’s history, offering an objective proving ground where difficulty scales with the level of competition. As AI systems become more general, mastering diverse games demonstrates their consistency across distinct cognitive skills. Beyond measuring performance, games can also serve as controlled sandbox environments to evaluate agentic safety, providing insight into model behavior in the complex environments they will encounter when deployed in the real world.