Building on a previous model called UniGen, a team of Apple researchers is showcasing UniGen 1.5, a system that can handle image understanding, generation, and editing within a single model. Here are the details.
Building on the original UniGen
Last May, a team of Apple researchers published a study called UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation.
In that work, they introduced a unified multimodal large language model capable of both image understanding and image generation within a single system, rather than relying on separate models for each task.
Now, Apple has published a follow-up to this study, in a paper titled UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning.
UniGen-1.5, explained
This new research extends UniGen by adding image editing capabilities to the model, still within a single unified framework, rather than splitting understanding, generation, and editing across different systems.
Unifying these capabilities in a single system is challenging because understanding and generating images require different approaches. However, the researchers argue that a unified model can leverage its understanding ability to improve generation performance.
According to them, one of the main challenges in image editing is that models often struggle to fully grasp complex editing instructions, especially when changes are subtle or highly specific.
To address this, UniGen-1.5 introduces a new post-training step called Edit Instruction Alignment:
... continue reading