Tech News
← Back to articles

Apple researchers develop on-device AI agent that interacts with apps for you

read original related products more articles

Despite having just 3 billion parameters, Ferret-UI Lite matches or surpasses the benchmark performance of models up to 24 times larger. Here are the details.

A bit of background on Ferret

In December 2023, a team of 9 researchers published a study called “FERRET: Refer and Ground Anything Anywhere at Any Granularity”. In it, they presented a multimodal large language model (MLLM) that was capable of understanding natural language references to specific parts of an image:

Since then, Apple has published a series of follow-up papers expanding the Ferret family of models, including Ferretv2, Ferret-UI, and Ferret-UI 2.

Specifically, Ferret-UI variants expanded on the original capabilities of FERRET, and were trained to overcome what the researchers defined as a shortcoming of general-domain MLLMs.

From the original Ferret-UI paper:

Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate “any resolution” on top of Ferret to magnify details and leverage enhanced visual features.

The original Ferret-UI study included an interesting application of the technology, where the user could talk to the model to better understand how to interact with the interface, as seen on the right.

A few days ago, Apple expanded the Ferret-UI family of models even further, with a study called Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents.

Ferret-UI was built on a 13B-parameter model, which focused primarily on mobile UI understanding and fixed-resolution screenshots. Meanwhile, Ferret-UI 2 expanded the system to support multiple platforms and higher-resolution perception.

... continue reading