AI-generated images have a general look to them. Shiny, bright, waxy-skin, and over-use of bokeh. From Midjourney to Gemini to OpenAI, the AI Look is consistent. Enthusiasts and professionals wrestle with prompts and even fine-tune these models to tamp down the AI smell, with varying degrees of success.
Examples of the "AI Look", provided by Krea in their technical paper.
Last week, Krea launched an open model, FLUX.1-Krea, that’s built to avoid the “AI Look”. Their writeup is tremendous: it details the problem, suggests root causes, and details how they overcame the challenge – all while providing necessary context. You should read the whole thing, but here’s the recap:
Model builders have been mostly focused on correctness, not aesthetics. Researchers have been overly focused on the extra fingers problem:
People often focus on how “smart” a model is. We often see users testing complex prompts. Can it make a horse ride an astronaut? Does it fill up the wine glass to the brim? Can it render text properly? Over the years, we have devised various benchmarks to formalize these questions into concrete metrics. However, in this pursuit of technical capabilities and benchmark optimization, the messy genuine look, stylistic diversity, and creative blend of early image models took a backseat.
Model builders rely on the same handful of aesthetic evaluators. Just as LLMs tend to optimize for the same handful of benchmarks, image models evaluate their results against a handful of aesthetic evaluators. And just like LLM benchmarks, these metrics produce common results:
We find that LAION Aesthetics — a model commonly used to obtain high quality training images — to be highly biased towards depicting women, blurry backgrounds, overly soft textures, and bright images. While these aesthetic scorers and image quality filters are useful for filtering out bad images, relying on these models to obtain high quality training images adds implicit biases to the model’s priors.
Model builders combine multiple aesthetics into a single unappealing average. No one likes art designed by committee (I’d argue it’s not “art” at all). And yet, this is what we get when we bring all our post-training to bear on a single image model:
It’s our belief that a model that has been fine-tuned on “global” user preference is suboptimal. For goals like text rendering, anatomy, structure, and prompt adherence where there’s an objective ground truth, data diversity and scale are helpful. However, for subjective goals such as aesthetics, it’s almost adversarial to mix different aesthetic tastes together.
For example:
... continue reading