67% of real fact-checks, top AI models don't agree on the answer. 1,000 claims, rated by 5 frontier LLMs. Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks Jordanov, Kosta · Lenz Research · [email protected] We presented 1,000 recent real user claims to the five top frontier LLMs and asked each one for a verdict. These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform. Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False). On 67% of claims, the panel splits. Key findings 67% of claims (672 / 1,000; 95% CI: 64–70%) have at least one frontier model dissenting from the panel majority — or no majority forms at all.
(672 / 1,000; 95% CI: 64–70%) have at least one frontier model dissenting from the panel majority — or no majority forms at all. 34% of claims (343 / 1,000; 95% CI: 31–37%) involve a 2+ bucket gap between the most-disagreeing pair of frontier verdicts — a substantive disagreement on the answer, not just a calibration shift.
(343 / 1,000; 95% CI: 31–37%) involve a 2+ bucket gap between the most-disagreeing pair of frontier verdicts — a substantive disagreement on the answer, not just a calibration shift. Krippendorff's α (ordinal) = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement.
across 5 raters on 1,000 items — nontrivial but limited agreement. The panel converges on definitive verdicts; the middle of the rubric is where it fractures. Within the 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True.
Within the 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True. Some models concentrate verdicts at the True/False poles; others spread across the middle two buckets.
1 How often the frontier disagrees On 67% of claims (672 / 1,000; 95% CI: 64–70%), the frontier panel doesn't agree — at least one model dissents from the majority verdict, or no strict majority forms at all. The breakdown: For each claim we looked at the five frontier verdicts and asked: did at least three pick the same answer (a strict majority)? If yes, how many of the remaining models dissented? If no clear majority emerged at all — verdicts split across three or four different buckets — the claim falls in the Models split, no majority row. Most of these claims are unlikely to appear in any training corpus with a gold label attached — there's no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to. We refer below to the "majority" and to "dissent from the majority." A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness. Frontier verdict pattern Claims Share of corpus All 5 agreed (unanimity) 328 33%
30–36% 1 of 5 dissented 224 22%
20–25% 2 of 5 dissented 316 32%
29–35% Models split, no majority (e.g. 2-2-1 or 2-1-1-1) 132 13%
11–15% ≥1 model dissents (incl. splits) 672 67%
... continue reading