What Apple's controversial research paper really tells us about LLMs

CHRISTOPH BURGSTEDT/SCIENCE PHOTO LIBRARY/Getty

Generative AI models quickly proved they were capable of performing technical tasks well. Adding reasoning capabilities to the models unlocked unforeseen capabilities, enabling the models to think through more complex questions and produce better-quality, more accurate responses -- or so we thought.

Last week, Apple released a research report called "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity." As the title reveals, the 30-page paper dives into whether large reasoning models (LRMs), such as OpenAI's o1 models, Anthropic's Claude 3.7 Sonnet Thinking (which is the reasoning version of the base model, Claude 3.7 Sonnet), and DeepSeek R1, are capable of delivering the advanced "thinking" they advertise.

(Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

Also: OpenAI's o1 lies more than any major AI model. Why that matters

Apple carried out the investigation by creating a series of experiments in the form of diverse puzzles that tested models beyond the scope of traditional math and coding benchmarks. The results showed that even the smartest models hit a point of diminishing returns, increasing reasoning to solve a problem's complexity up to a limit.

I encourage you to read it if you are remotely interested in the subject. However, if you don't have the time and just want the bigger themes, I unpack it for you below.

What are large reasoning models (LRMs)?

In the research paper, Apple uses "large reasoning models" when referring to what we would typically just call reasoning models. This type of large language model (LLM) was first popularized by the release of OpenAI's o1 model, which was later followed by its release of o3.

The concept behind LRMs is simple. Humans are encouraged to think before they speak to produce a comment of higher value; similarly, when a model is encouraged to spend more time processing through a prompt, its answer quality should be higher, and that process should enable the model to respond to more complex prompts well.

... continue reading