Large language models sometimes produce confident, plausible falsehoods (“hallucinations”), limiting their reliability1,2. Prior work has offered numerous explanations and effective mitigations such as retrieval and tool use3, consistency-based self-verification4, and reinforcement learning from human feedback5. Nonetheless, the problem persists even in state-of-the-art language models6,7. Here we show how next-word prediction and accuracy-based evaluations inadvertently reward unwarranted guessing. Initially, next-word pretraining creates statistical pressure toward hallucination even with idealized error-free data: using learning theory8,9, we show that facts lacking repeated support in training data (such as one-off details) yield unavoidable errors, while recurring regularities (such as grammar) do not. Subsequent training stages aim to correct such errors. However, dominant headline metrics like accuracy systematically reward guessing over admitting uncertainty. To align incentives, we suggest two additions to the classic approach of adding error penalties to evaluations to control abstention10,11. First, we propose “open-rubric” evaluations that explicitly state how errors are penalized (if at all), which test whether a model modulates its abstentions to stated stakes while optimizing accuracy. Second, since hallucination-specific benchmarks rarely make leaderboards12, we suggest using open-rubric variants of existing evaluations, to reverse their guessing incentives. Reframing hallucination as an incentive problem opens a practical path toward more reliable language models.
Evaluating large language models for accuracy incentivizes hallucinations
Why This Matters
This article highlights how current evaluation methods for large language models inadvertently incentivize hallucinations, leading to false and confident outputs. Addressing these issues is crucial for improving the reliability and trustworthiness of AI systems used by consumers and industries alike.
Key Takeaways
- Next-word prediction inherently encourages unwarranted guessing, especially for unique facts.
- Traditional accuracy metrics reward hallucinations, making models less reliable.
- Proposed evaluation reforms, like open-rubric assessments, can better align incentives and reduce hallucinations.
Explore topics:
large language models
hallucinations
open-rubric evaluations
accuracy metrics
reinforcement learning
Get alerts for these topics