Apple trained an LLM to teach itself good UI code in SwiftUI

In a new study, a group of Apple researchers describe a very interesting approach they took to, basically, get an open-source model to teach itself how to build good user interface code in SwiftUI. Here’s how they did it.

In the paper UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback, the researchers explain that while LLMs have gotten better at multiple writing tasks, including creative writing and coding, they still struggle to “reliably generate syntactically-correct, well-designed code for UIs.” They also have a good idea why:

Even in curated or manually authored finetuning datasets, examples of UI code are extremely rare, in some cases making up less than one percent of the overall examples in code datasets.

To tackle this, they started with StarChat-Beta, an open-source LLM specialized in coding. They gave it a list of UI descriptions, and instructed it to generate a massive synthetic dataset of SwiftUI programs from those descriptions.

Then, they ran every piece of code through a Swift compiler to make sure it actually ran, followed by an analysis by GPT-4V, a vision-language model that compared the compiled interface with the original description.

Any outputs that failed to compile, looked irrelevant, or were duplicates, were tossed. The remaining outputs formed a high-quality training set, which then was used to fine-tune the model.

They repeated this process multiple times and noted that with each iteration, the improved model generated better SwiftUI code than before. That, in turn, fed into an even cleaner dataset.

After five rounds, they had nearly one million SwiftUI programs (996,000 to be precise) and a model they call UICoder, which consistently compiled and produced interfaces much closer to the prompts than the starting model.

In fact, according to their tests, UICoder significantly outperformed the base StarChat-Beta model on both automated metrics, and human evaluations.

UICoder also came close to matching GPT-4 in overall quality, and actually surpassed it in compilation success rate.

... continue reading