The Human Creativity Benchmark – Evaluating Generative AI in Creative Work

A framework for evaluating generative AI in creative work that separates convergence, where evaluators agree on best practices, from divergence, where they legitimately disagree because the question has shifted to taste.

1 .0 Introduction

When professional creatives evaluate AI-generated work, their judgments produce two distinct signals. The first is convergence: evaluators agree on what works, revealing shared best practices like readable typography, functional layout, and strong visual hierarchy. The second is divergence: evaluators disagree, and that disagreement reflects genuine differences in taste, aesthetic direction, and creative intent. Most AI benchmarks treat the second signal as noise to be resolved. The Human Creativity Benchmark separates the two, distinguishing where a model needs to be correct from where it needs to be steerable toward taste, and finds that no current model is reliably both.

This distinction matters because creative work has no ground truth. The dimensions on which experts disagree — aesthetic direction, mood, conceptual risk — are not reducible to miscalibration or error [1][2]. Standard evaluation approaches, including majority voting, adjudication, and gold-standard reconciliation, treat evaluator disagreement as something to resolve [3][4]. These methods work where labels have objective answers. In creative domains, they would smooth out the information most worth preserving. Work in annotation science has recognized that disagreement can carry signal [5], and frameworks like CrowdTruth have formalized this for labeling tasks [4]. The Human Creativity Benchmark applies that insight to creative evaluation, where the standard resolution strategies are structurally wrong because taste is legitimately distributed across professionals. Flattening it into a single quality score artificially homogenizes an otherwise diverse workflow and creative process, and produces exactly the generic output that professionals already find unusable.

That homogeneity is already a practical problem. Generative models tend toward mode collapse [6][7]: when multiple models are given the same creative brief, they converge on safe, averaged aesthetics rather than distinctive directions. Creative professionals depend on differentiated output. They use AI for trend awareness, style inspiration, and rapid exploration — deciding "what to build?" and validating "is it good?" [8][9]. Both require a range of possible directions, and the creative process extends well past first draft [10]. Designers iterate fluidly, revisit stages, and make hundreds of small judgment calls where the distance between "good enough" and "right" is entirely a matter of taste [2]. A model that converges on a single safe default fails this workflow even when the output is technically competent.

The HCB proposes that creative quality is measured along evaluation axes that fall on a spectrum from objectively verifiable to inherently subjective. Prompt adherence sits at the clear end: did the model follow the instructions? Visual appeal sits at the taste end: does this feel right? Usability falls in the middle, where shared conventions exist but leave room for interpretive difference. Convergence and divergence are properties of these dimensions themselves. Verifiable axes produce agreement because the criteria are shared and checkable. Taste-driven axes produce disagreement because the criteria are personal. The separation of these observations, not just the observation that they exist, is what makes this framework useful.

Convergence and divergence as two interacting signals in creative evaluation. Convergence rises as work approaches production; divergence remains structurally present where the question shifts to taste.

Convergence by scalar question and domain. Prompt adherence and usability produce higher agreement than visual appeal; Desktop Apps and Landing Page converge most, while Ad Video and Brand Assets remain the most divergent.

Convergence captures best practices: shared standards like composition (visual hierarchy, balance), clarity (legibility, information architecture), and technical correctness (rendering, proper alignment, absence of artifacts) that are stable, repeatable, and critical for training models to produce reliable outputs. Divergence captures taste: variation in aesthetic judgment, interpretation, and creative intent that defines what makes creative work distinctive and is essential for steerability, personalization, and creative control. These signals are not always cleanly separated. Best practices may conflict depending on the objective, and apparent agreement may result from limited model expressivity, where outputs are too similar to elicit meaningful differences in judgment and convergence reflects a lack of variation rather than strong alignment.

The benchmark measures both signals through three complementary methods. Pairwise forced-ranking surfaces relative preference. Scalar ratings on three dimensions surface where agreement concentrates. Open-ended qualitative follow-ups surface the reasoning behind each judgment. Together, they produce data that distinguishes convergence-driven dimensions from divergence-driven ones. Collapsing them into a single quality score discards the most actionable information: where a model needs to be correct versus where it needs to be steerable.

... continue reading