Apple researchers taught an AI model to reason about app interfaces

A new Apple-backed study, in collaboration with Aalto University in Finland, introduces ILuvUI: a vision-language model trained to understand mobile app interfaces from screenshots and from natural language conversations. Here’s what that means, and how they did it.

ILuvUI: an AI that outperformed the model it was based on

In the paper, ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations, the team tackles a long-standing challenge in human-computer interaction, or HCI: teaching AI models to reason about user interfaces like humans do, which in practice means visually, as well as semantically.

“Understanding and automating actions on UIs is a challenging task since the UI elements in a screen, such as list items, checkboxes, and text fields, encode many layers of information beyond their affordances for interactivity alone. (….) LLMs in particular have demonstrated remarkable abilities to comprehend task instructions in natural language in many domains, however using text descriptions of UIs alone with LLMs leaves out the rich visual information of the UI. “

Currently, as the researchers explain, most vision-language models are trained on natural images, like dogs or street signs, so they don’t perform as well when asked to interpret more structured environments, like app UIs:

“Fusing visual with textual information is important to understanding UIs as it mirrors how many humans engage with the world. One approach that has sought to bridge this gap when applied to natural images are Vision-Language Models (VLMs), which accept multimodal inputs of both images and text, typically output only text, and allow for general-purpose question answering, visual reasoning, scene descriptions, and conversations with image inputs. However, the performance of these models on UI tasks fall short compared to natural images because of the lack of UI examples in their training data.”

With that in mind, the researchers fine-tuned the open-source VLM LLaVA, and they also adapted its training method to specialize in the UI domain.

They trained it on text-image pairs that were synthetically generated following a few “golden examples”. The final dataset included Q&A-style interactions, detailed screen descriptions, predicted action outcomes, and even multi-step plans (like “how to listen to the latest episode of a podcast,” or “how to change brightness settings.”)

Once trained on this dataset, the resulting model, ILuvUI, was able to outperform the original LLaVA in both machine benchmarks and human preference tests.

What’s more, it doesn’t require a user to specify a region of interest in the interface. Instead, the model understands the entire screen contextually from a simple prompt:

... continue reading