New paper pushes back on Apple’s LLM ‘reasoning collapse’ study

Apple’s recent AI research paper, “The Illusion of Thinking”, has been making waves for its blunt conclusion: even the most advanced Large Reasoning Models (LRMs) collapse on complex tasks. But not everyone agrees with that framing.

Today, Alex Lawsen, a researcher at Open Philanthropy, published a detailed rebuttal arguing that many of Apple’s most headline-grabbing findings boil down to experimental design flaws, not fundamental reasoning limits. The paper also credits Anthropic’s Claude Opus model as its co-author.

The rebuttal: Less “illusion of thinking,” more “illusion of evaluation”

Lawsen’s critique, aptly titled “The Illusion of the Illusion of Thinking,” doesn’t deny that today’s LRMs struggle with complex planning puzzles. But he argues that Apple’s paper confuses practical output constraints and flawed evaluation setups with actual reasoning failure.

Here are the three main issues Lawsen raises:

Token budget limits were ignored in Apple’s interpretation:

At the point where Apple claims models “collapse” on Tower of Hanoi puzzles with 8+ disks, models like Claude were already bumping up against their token output ceilings. Lawsen points to real outputs where the models explicitly state: “The pattern continues, but I’ll stop here to save tokens.”

Impossible puzzles were counted as failures:

Apple’s River Crossing test reportedly included unsolvable puzzle instances (for example, 6+ actor/agent pairs with a boat capacity that mathematically can’t transport everyone across the river under the given constraints). Lawsen calls attention to the fact that models were penalized for recognizing that and refusing to solve them.

Evaluation scripts didn’t distinguish between reasoning failure and output truncation:

... continue reading