Even (very) noisy LLM evaluators are useful for improving AI agents
May 12, 2026 · Alan Mishler
Summary LLM evaluators are often noisy and weakly correlated with real-world outcomes.
Noisy evaluators have limited value for production decisions that hinge on judging a single output (e.g. guardrails).
However, even (very) noisy evaluators can reliably tell you which agent is better on average, meaning they can still help you pick the best variant to deploy and improve it over time.
It’s surprisingly hard to develop reliable LLM evaluators: they’re often noisy and poorly correlated with the metrics or outcomes practitioners actually care about. Sometimes the target is directly measurable but evaluators still disagree with experts (e.g. on correctness or faithfulness to a source document). Other times the target is only accessible through a proxy (e.g. whether code that passes tests satisfies user needs). And sometimes the target is hard to observe at all (e.g. whether a customer was actually happy with an interaction).
Why is it so hard to develop reliable LLM evaluators? Rule-based and classical NLP metrics are often brittle and miss the semantic dimensions that matter.1, 2 Learned reward models are vulnerable to distribution shift3 and reward hacking.4 Studies of LLM-as-a-judge setups have repeatedly documented systematic biases and limitations: judges are heavily swayed by surface-level style,5 prefer longer responses to shorter ones of similar quality,6 are inconsistent across repeated evaluators and minor prompt variations,7 often align poorly with human judgments,8 and may correlate weakly with the downstream outcomes they’re meant to predict.9
An evaluator’s quality can be measured at two granularities:
Output-level correlation measures how well its score on individual outputs matches real-world outcomes. It governs production workflows (e.g. guardrails), where decisions hinge on individual outputs and noisy evaluators are unreliable. We’ll call an evaluator noisy with respect to a metric or outcome of interest if its output-level correlation is low.
It governs production workflows (e.g. guardrails), where decisions hinge on individual outputs and noisy evaluators are unreliable. We’ll call an evaluator noisy with respect to a metric or outcome of interest if its output-level correlation is low. Agent-level correlation measures how well its average over many outputs matches an agent’s real-world quality. It governs offline variant selection (e.g. picking the best prompt or model), and, unlike output-level correlation, it generally climbs with sample size as per-output noise averages out.
... continue reading