I think ARC-AGI is still the most important benchmark we have today. It’s surprising that LLMs can win the math olympiad but struggle with simple puzzles that humans can solve easily.
This highlights a core limitation of current LLMs: they struggle to reason about things they weren't trained on. They struggle to generalize. But they are getting better, fast.
Last December, I got first place on ARC-AGI v1 with a score of 53.6%. A lot has changed since then. Thinking models had just come out and they were not so good at thinking. o1 was still in preview. Deepseek’s R1 was not out yet.
Two weeks after my score was released, o3 preview beat it handily, getting 75.7% spending $200 per task.
But today I got my revenge. My latest program achieves a new high score of 79.6% on ARC v1 at $8.42 per task (25× more efficient than o3) and, more importantly, sets a new state-of-the-art (SoTA) of 29.4% on ARC v2 (previous best: 25%). I used the same Evolutionary Test-Time Compute architecture as my v1 solution but replaced Python functions with plain English instructions.
The system works by having Grok-4 generate natural language instructions for solving each task. Grok-4 subagents test these instructions against training examples, scoring their accuracy. The best-performing instructions spawn new generations of refined solutions. Through multiple evolutionary cycles, the system generates up to 40 candidate instructions using 36 dynamic prompts per task. You can find the code here.
When I built my v1 solution 10 months ago, it was my first AI research project. Since then, I’ve been training large models with reinforcement learning, which has changed how I think about reasoning and intelligence. In this post, I’ll describe my latest solution, how I’ve updated my thinking, and how I think we can get to general intelligence.
This post has the following sections:
What is ARC-AGI
My Method
... continue reading