Apple trained an AI that captions images better than models ten times its size

Apple researchers have developed a new way to train AI models for image captioning that delivers more accurate, detailed descriptions while using far smaller models. Here are the details.

New model could speed up the training of future multimodal AIs

In a new study titled RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning, a team of Apple Researchers collaborated with the University of Wisconsin—Madison to develop a new framework for a dense image captioning model, yielding state-of-the-art results across multiple benchmarks.

Dense image captioning is the task of generating detailed, region-level descriptions of everything happening within an image, rather than a single overall summary.

In other words, it identifies multiple elements and regions in an image, and describes them with fine-grain detail, resulting in a much richer understanding of the scene than an overall description.

Here’s are a few examples from Stanford’s original dense captioning paper, DenseCap: Fully Convolutional Localization Networks for Dense Captioning:

Dense image captioning can be used for a variety of tasks, such as training vision-language and text-to-image models. When applied to user-facing features, it can improve image search and even accessibility tools.

The problem, according to the researchers, is that current AI-based approaches to training dense image captioning models tend to fall short in significant ways:

Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised distillation often yields limited output diversity and weak generalization. Reinforcement learning (RL) could overcome these limitations, but its successes have so far been concentrated in verifiable domains that rely on deterministic checkers—a luxury not available in open-ended captioning.

With that in mind, they proposed a new framework to tackle these limitations, which took an interesting approach.

... continue reading