During WWDC25, Apple announced new versions of its on-device and cloud-based foundation models. Now, they have published a tech report detailing how those models were trained, optimized, and evaluated. And the report includes some genuinely interesting under-the-hood tidbits.
In a comprehensive document called “Apple Intelligence Foundation Language Models – Tech Report 2025“, the company walks through multiple aspects of the new models, including their architecture, data sources, pre-training, post-training, tool use development, optimizations, and benchmarks.
Modeling overview for the Apple foundation models. Image: Apple
It is a very technical, but very worthwhile read if you like to get into the nuts and bolts of this sort of stuff. Here are a few particularly interesting highlights.
The local model was split into two blocks
We already knew that Apple’s on-device model (the one developers will get to tap into) has around 3 billion parameters. Now, the company has detailed that this model is actually divided into two blocks:
“Block 1 contains 62.5% of the total transformer layers, while Block 2 contains the remaining 37.5% of the transformer layers, but had the key and value projections removed.”
In practice, this means that the local model requires 37.5% less memory for caching, and the time it takes to output the first token (basically, a fragment of a word) was also cut by about 37.5%. Still, Apple structured the split in a way that it says preserves the model’s overall performance and output quality.
Apple’s on-device vs external models on representative benchmarks. Image: Apple
As a side note, a few years ago, Apple published this study, which looked at swapping parts of an LLM between RAM and flash storage as needed, in order to pack a local model that was bigger than what would otherwise fit on the device’s memory.
... continue reading