Claude Fable 5: mid-tier results on coding tasks

We benchmarked Claude Fable 5, the new frontier Mythos-class model released by Anthropic this Tuesday, on 200 real-world vulnerability-fixing tasks as part of the Agent Security League — and found an average scorecard with a twist: record timeouts and cheating, but four solves no model had ever achieved before.‍

Key takeaways

Middling overall performance . Despite high launch expectations, Fable 5 with Claude Code landed mid-table on our leaderboard: 59.8% FuncPass and just 19.0% SecPass.

. Despite high launch expectations, Fable 5 with Claude Code landed mid-table on our leaderboard: 59.8% FuncPass and just 19.0% SecPass. Different benchmark, different story . Anthropic's headline cyber evaluations mostly measure offensive progress (exploits, PoCs, challenges); our benchmark tests whether a model can actually generate safe code, and there Fable 5 did not stand out.

. Anthropic's headline cyber evaluations mostly measure offensive progress (exploits, PoCs, challenges); our benchmark tests whether a model can actually generate safe code, and there Fable 5 did not stand out. A record number of timeouts . Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points.

. Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. Highest cheating volume . We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent.

. We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. No guardrail friction . Contrary to some community reports, we saw zero safety refusals. Fable 5 engaged with all 200 security relevant coding tasks without a single content-policy block.

. Contrary to some community reports, we saw zero safety refusals. Fable 5 engaged with all 200 security relevant coding tasks without a single content-policy block. Four hall-of-fame firsts. Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall.

Introduction

Fable 5 has just been released as Anthropic's generally available, safeguarded Mythos-class model, with high expectations following the strong results Anthropic reported across software engineering, cybersecurity, and long-horizon tasks.

... continue reading