Releasing open weights for FLUX.1 Krea

However, a clear trend when working with AI generated images is their unique look: overly-blurry backgrounds, waxy skin textures, boring composition, and more. Together, these problems constitute what is now known as the “AI look” .

Image generation has come a long way since the early days of generating cats and flowers with GANs. Today's models can generate coherent human faces, limbs, and hands. They understand exact quantities, render complex typography, and make an astronaut ride a horse.

People often focus on how "smart" a model is. We often see users testing complex prompts. Can it make a horse ride an astronaut? Does it fill up the wine glass to the brim? Can it render text properly? Over the years, we have devised various benchmarks to formalize these questions into concrete metrics. The research community has done a remarkable job advancing generative models. However, in this pursuit of technical capabilities and benchmark optimization, the messy genuine look, stylistic diversity, and creative blend of early image models took a backseat.

Generations from DALLE-2, an imperfect model which produced flawed but interesting outputs. Source

Our goal from the beginning was simple: "Make AI images that don't look AI." As users of generative AI ourselves, we wanted to create a model that addressed these issues. Unfortunately, many of the academic benchmarks and metrics are misaligned with what users actually want.

For the pre-training stage, metrics such as Fréchet inception distance (FID) and CLIP Score are useful for measuring general performance of the model since most images at this stage are incoherent. Beyond the pre-training stage, evaluation benchmarks such as DPG , GenEval , T2I-Compbench , and GenAI-Bench are widely used to benchmark academic and industry models. But, these benchmarks are limited to measuring prompt adherence with focus on spatial relationships, attribute binding, object counts, etc.

For evaluating aesthetics, models such as LAION-Aesthetics , Pickscore , ImageReward , HPSv2 are commonly used, but many of these models are finetuned variants of CLIP , which processes low-resolution images (224×224 pixels) with limited parameter count. As the capability of image generation models has increased, these older aesthetic score models are no longer good enough to evaluate them.

For instance, we find that LAION Aesthetics — a model commonly used to obtain high quality training images — to be highly biased towards depicting women, blurry backgrounds, overly soft textures, and bright images. While these aesthetic scorers and image quality filters are useful for filtering out bad images, relying on these models to obtain high quality training images adds implicit biases to the model's priors.

Examples of images that are in the top 5% of LAION aesthetics

While better aesthetic scorers based on vision language models ([1], [2]) are emerging, the issue remains that human preference and aesthetics are highly personal. They cannot be easily reduced to a single number. Advancing model capabilities without regressing towards the “AI look” requires careful data curation and thorough calibration of model outputs.