I think ARC-AGI is still the most important benchmark we have today. It’s surprising that LLMs can win the math olympiad but struggle with simple puzzles that humans can solve easily.
This highlights a core limitation of current LLMs: they struggle to reason about things they weren't trained on. They struggle to generalize. But they are getting better, fast.
Last December, I got first place on ARC-AGI v1 with a score of 53.6%. A lot has changed since then. Thinking models had just come out and they were not so good at thinking. o1 was still in preview. Deepseek’s R1 was not out yet.
Two weeks after my score was released, o3 preview beat it handily, getting 75.7% spending $200 per task.
But today I got my revenge. My latest program achieves a new high score of 79.6% on ARC v1 at $8.42 per task (25× more efficient than o3) and, more importantly, sets a new state-of-the-art (SoTA) of 29.4% on ARC v2 (previous best: 25%). I used the same Evolutionary Test-Time Compute architecture as my v1 solution but replaced Python functions with plain English instructions.
The system works by having Grok-4 generate natural language instructions for solving each task. Grok-4 subagents test these instructions against training examples, scoring their accuracy. The best-performing instructions spawn new generations of refined solutions. Through multiple evolutionary cycles, the system generates up to 40 candidate instructions using 36 dynamic prompts per task. You can find the code here.
When I built my v1 solution 10 months ago, it was my first AI research project. Since then, I’ve been training large models with reinforcement learning, which has changed how I think about reasoning and intelligence. In this post, I’ll describe my latest solution, how I’ve updated my thinking, and how I think we can get to general intelligence.
This post has the following sections:
What is ARC-AGI
My Method
ARC and AGI
What is ARC-AGI
ARC-AGI is an intelligence test designed to measure abstract pattern recognition, similar to an IQ test. What makes it notable is the stark performance gap between humans and AI: while humans can readily solve these puzzles, LLMs struggle significantly. The test presents novel patterns through a few examples and then challenges the test-taker to continue the sequence, measuring their ability to identify and generalize underlying rules they've never encountered before.
Let’s look at a real challenge. Here, you are given two examples of input/output grids and you must fill in the test output grid with the correct colors.
The solution is:
ARC-AGI v1 had tasks like this, but slightly harder on average. ARC-AGI v2, created in early 2025, has much harder tasks that require more multi-step reasoning to solve. For example:
And the solution:
v2 challenges are still not so hard for humans. Smart humans can get 100% accuracy on a batch of 100 challenges. In contrast, the best LLMs get only 16%. You can find the leaderboard here.
My Method
It’s worth reading my v1 solution first, which goes into deeper technical details about how the architecture works.
My original solution used language models to generate Python functions to solve tasks. This approach had a key advantage: functions are deterministic and testable. I could generate hundreds of candidate functions, rank them by their performance on training examples, and evolve better solutions from the highest-scoring ones.
This strategy hits a wall with ARC v2. The transformations are often too complex to express elegantly in Python—they require nuanced pattern recognition and contextual understanding that would result in unwieldy, brittle code. So I turned to a language much older than Python: English.
My v2 solution is essentially the same evolutionary architecture, but it evolves natural language instructions instead of code.
The Core Loop
For each task, I use a language model to generate plain-English instructions describing how to transform input grids to output grids. To evaluate these instructions, I have a sub-agent model apply them to the training examples—treating each training grid as if it were the test grid and generating what it believes is the correct output. This gives me a fitness score for each instruction based on how many training examples it solves correctly (or partially, counting the percentage of correct cells).
Once I have a population of scored instructions, the evolution begins through two distinct revision strategies: individual and pooled.
Individual vs. Pooled Revisions
Individual revisions take a single instruction along with its generated outputs and the ground truth. The model sees both the raw grids and an ASCII diff highlighting the discrepancies. Armed with this feedback, it refines the instruction to correct its mistakes.
Pooled revisions follow the same principle but combine multiple instructions into a single context. The model is prompted to synthesize a new instruction that incorporates the successful elements from each parent instruction.
The full prompts are in the footnotes.
You might expect pooled revisions to always outperform individual ones—after all, more context should mean better results. In practice, it's not that simple. Thinking models generate extensive reasoning tokens, and for many ARC tasks, including more than two instructions causes the context to exceed token limits, stalling the response. Moreover, like humans, language models can lose focus when overwhelmed with information. Sometimes the additional context from multiple instructions actually degrades reasoning performance rather than enhancing it.
The Final Architecture
After extensive experimentation, I converged on this design:
Initial generation: Use Grok-4 to generate 30 candidate instructions Individual revision phase: If no perfect solutions emerge, take the top 5 instructions and run each through individual revision Pooled revision phase: If still no perfect solutions, take the 5 highest-scoring instructions, create a pooled revision prompt, and generate 5 new candidates from it
In the worst case (when solutions aren't found until the final step), this produces 40 total instruction attempts per task: 30 initial + 5 individual revisions + 5 pooled revisions. This balance provides enough exploration in the initial phase, focused refinement in the individual revision phase, and creative recombination in the pooled phase—all while staying within computational constraints.
ARC and AGI
I agree with François Chollet's definition of AGI: a system that can efficiently acquire new skills outside of its training data. In practical terms, we'll know we've achieved AGI when we can no longer create tasks that are easy for humans but hard for AI.
ARC-AGI exemplifies this gap well. LLMs are PhD-level reasoners in math and science, yet they fail at children's puzzles. How is this possible?
Dead Reasoning Zones
When LLMs try to solve ARC tasks, they don't just fail, they fail in ways that violate basic logic. I have 100k+ traces of thinking models generating obviously false instructions. They'll spend 20 minutes "thinking" and then confidently claim an object is symmetrical when it obviously isn't. When corrected, they still can't see the error.
This isn't how humans work. Einstein never saw ARC grids, but he'd solve them instantly. Not because of prior knowledge, but because humans have consistent reasoning that transfers across domains. A logical economist becomes a logical programmer when they learn to code. They don't suddenly forget how to be consistent or deduce.
But LLMs have "dead reasoning zones" — areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones. Asking questions outside the training distribution is almost like an adversarial attack on the model.
The Fused Circuit Problem
Neural networks learn the distribution they're trained on. Many see this as a fundamental blocker to generalization, that their skills are ossified at training time.
I think that's half right. Neural networks do only learn their training distribution. But reasoning itself can be part of that distribution. Reasoning is the generalization engine — the skill that enables acquiring all other skills.
The problem is how current LLMs learn reasoning. When they train on math, they learn math reasoning. When they train on code, they learn coding reasoning. But these reasoning circuits are fused with domain-specific circuits. There's some transfer (learning math reasoning helps with economics), but it's incomplete. The model doesn't fundamentally master logic itself, it masters math-logic, code-logic, writing-logic as separate skills.
It's as if humans stored a compressed kernel of deduction and logic that we call upon for everything, while LLMs store this kernel fragmented across domain-specific embeddings. They're overfitting to domain-specific reasoning patterns.
Bringing Reasoning into Distribution
When I wrote my original ARC solution, I said "LLMs are trained with induction. They learn by predicting the next word... This makes them great at outputting words that sound correct." Looking back, it's obvious in retrospect that RL over chain-of-thought was the next step.
With RL, models no longer just learn what sounds correct based on patterns they've seen. They learn what words to output to be correct. RL is the process of forcing the pre-trained weights to be logically consistent.
We don't need models to escape their training distribution. We need to bring reasoning itself fully into that distribution. Not domain-specific reasoning, but the pure skill of logical deduction and consistency that humans apply universally. When models have consistent, transferable reasoning, we'll have AGI.