Do reasoning models really “think” or not? Apple research sparks lively debate, response

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more

Apple’s machine-learning group set off a rhetorical firestorm earlier this month with its release of “The Illusion of Thinking,” a 53-page research paper arguing that so-called large reasoning models (LRMs) or reasoning large language models (reasoning LLMs) such as OpenAI’s “o” series and Google’s Gemini-2.5 Pro and Flash Thinking don’t actually engage in independent “thinking” or “reasoning” from generalized first principles learned from their training data.

Instead, the authors contend, these reasoning LLMs are actually performing a kind of “pattern matching” and their apparent reasoning ability seems to fall apart once a task becomes too complex, suggesting that their architecture and performance is not a viable path to improving generative AI to the point that it is artificial generalized intelligence (AGI), which OpenAI defines as a model that outperforms humans at most economically valuable work, or superintelligence, AI even smarter than human beings can comprehend.

ACT NOW: Come discuss the latest LLM advances and research at VB Transform on June 24-25 in SF — limited tickets available. REGISTER NOW

Unsurprisingly, the paper immediately circulated widely among the machine learning community on X and many readers’ initial reactions were to declare that Apple had effectively disproven much of the hype around this class of AI: “Apple just proved AI ‘reasoning’ models like Claude, DeepSeek-R1, and o3-mini don’t actually reason at all,” declared Ruben Hassid, creator of EasyGen, an LLM-driven LinkedIn post auto writing tool. “They just memorize patterns really well.”

But now today, a new paper has emerged, the cheekily titled “The Illusion of The Illusion of Thinking” — importantly, co-authored by a reasoning LLM itself, Claude Opus 4 and Alex Lawsen, a human being and independent AI researcher and technical writer — that includes many criticisms from the larger ML community about the paper and effectively argues that the methodologies and experimental designs the Apple Research team used in their initial work are fundamentally flawed.

While we here at VentureBeat are not ML researchers ourselves and not prepared to say the Apple Researchers are wrong, the debate has certainly been a lively one and the issue about the capabilities of LRMs or reasoner LLMs compared to human thinking seems far from settled.

How the Apple Research study was designed — and what it found

Using four classic planning problems — Tower of Hanoi, Blocks World, River Crossing and Checkers Jumping — Apple’s researchers designed a battery of tasks that forced reasoning models to plan multiple moves ahead and generate complete solutions.

These games were chosen for their long history in cognitive science and AI research and their ability to scale in complexity as more steps or constraints are added. Each puzzle required the models to not just produce a correct final answer, but to explain their thinking along the way using chain-of-thought prompting.

... continue reading