One of the very first announcements on this year’s WWDC was that for the first time, third‑party developers will get to tap directly into Apple’s on‑device AI with the new Foundation Models framework. But how do these models actually compare against what’s already out there?
With the new Foundation Models framework, third-party developers can now build on the same on-device AI stack used by Apple’s native apps.
In other words, this means that developers will now be able to integrate AI features like summarizing documents, pulling key info from user text, or even generating structured content, entirely offline, with zero API cost.
But how good are Apple’s models, really?
Competitive where it counts
Based on Apple’s own human evaluations, the answer is: pretty solid, especially when you consider the balance (which some might call ‘tradeoff’) between size, speed, and efficiency.
In Apple’s testing, its ~3B parameter on-device model outperformed similar lightweight vision-language models like InternVL-2.5 and Qwen-2.5-VL-3B in image tasks, winning over 46% and 50% of prompts, respectively.
And in text, it held its ground against larger models like Gemma-3-4B, even edging ahead in some international English locales and multilingual evaluations (Portuguese, French, Japanese, etc.).
In other words, Apple’s new local models seem set to deliver consistent results for many real-world uses without resorting to the cloud or requiring data to leave the device.
When it comes to Apple’s server model (which won’t be accessible by third-party developers like the local models), it compared favorably to LLaMA-4-Scout and even outperformed Qwen-2.5-VL-32B in image understanding. That said, GPT-4o still comfortably leads the pack overall.
... continue reading