Evaluating publicly available LLMs on IMO 2025

Introduction

Recent progress in the mathematical capabilities of LLMs have created a need for increasingly challenging benchmarks. With MathArena, we address this need by evaluating models on difficult and recent mathematical competitions, offering benchmarks that are both uncontaminated and interpretable. Among these competitions, the International Mathematical Olympiad (IMO) stands out as the most well-known and prestigious. As such, an evaluation of the IMO 2025, which took place just a few days ago, is a necessary addition to the MathArena leaderboard. In this post, we present our methodology and results from evaluating several state-of-the-art models on the IMO 2025 problems. Our goal was to evaluate whether these models could reach key milestones corresponding to medal-level performance: bronze (top 50%), silver (top 25%), or even gold (top 8%). To investigate the true limits of current LLMs, we used a best-of-n selection methods to scale inference-time compute as much as possible in an attempt to reach one of these milestones.

The best-performing model is Gemini 2.5 Pro, achieving a score of 31% (13 points), which is well below the 19/42 score necessary for a bronze medal. Other models lagged significantly behind, with Grok-4 and DeepSeek-R1 in particular underperforming relative to their earlier results on other MathArena benchmarks. We also share some initial qualitative observations in this post, but we invite the community to conduct their own analyses. Visit our website to explore the raw model outputs and dive deeper into the results.

Update 19/07: OpenAI has announced they achieved a gold medal with a currently unreleased model. The IMO organizers have confirmed that they validated the proofs generated by the model, but could not validate how these proofs were achieved. We are happy to see the steep progress in this field, and look forward to the release of the model to make independent evaluations possible using public benchmarks like MathArena.

Methodology

Setup We followed a methodology similar to our evaluation of the 2025 USA Math Olympiad [1]. In particular, four experienced human judges, each with IMO-level mathematical expertise, were recruited to evaluate the responses. Evaluation began immediately after the 2025 IMO problems were released to prevent contamination. Judges reviewed the problems and developed grading schemes, with each problem scored out of 7 points. To ensure fairness, each response was anonymized and graded independently by two judges. Grading was conducted using the same interface developed for our Open Proof Corpus project [2].

Models We evaluated five state-of-the-art models: o3, o4-mini, Gemini-2.5-Pro, Grok-4, and Deepseek-R1 (05/28). These were selected based on prior performance on the MathArena competitions. Each model was run with the recommended hyperparameters and a maximum token limit of 64,000. No models needs more than this number of tokens. We used the same prompting strategy as in our Open Proof Corpus evaluation (provided at the bottom of this post). For each problem, each model generated four distinct responses.

Best-of-n Selection A key critique of our USAMO evaluation was that models shouldn't be expected to answer extremely difficult problems in a single attempt. This critique is even more relevant for the even more difficult IMO problems. To mitigate this limitation, we applied a best-of-32 selection strategy using a method based on previous work [3]. In our prior work [2], we found that this method works very well for proof generation tasks, almost doubling performance of the models on the data we had at hand. Specifically, for each model solution, we first generated 32 responses. These responses were evaluated in a bracket-style tournament using an LLM-as-a-judge system to select winners in head-to-head comparisons. Here, the model itself was used to evaluate its own responses. The model judged each pair and selected the stronger response. This process was repeated until a single best response remained which was then presented to the human judges for evaluation. We use the same prompt for judging as in our own prior work and we repeat it at the bottom of the post for completeness. This selection process was computationally and financially intensive: on average, each final model answer cost at least 3 dollars to generate, with Grok-4 costing over 20 dollars per answer. As such, the performance reported here represents the models' best achievable output within a reasonable resource budget.

Results

As mentioned above, Gemini 2.5 Pro achieved the highest score with an average of 31% (13 points). While this may seem low, especially considering the $400 spent on generating just 24 answers, it nonetheless represents a strong performance given the extreme difficulty of the IMO. However, these 13 points are not enough for a bronze medal (19/42). In contrast, other models trail significantly behind and we can already safely say that none of them will achieve the bronze medal. Full results are available on our leaderboard, where everyone can explore and analyze individual responses and judge feedback in detail.

... continue reading