GPT-5.5: Mythos-Like Hacking, Open to All

Anthropic has Mythos, but only a select few have seen it. Now, OpenAI has a model that, by all accounts, seems rather comparable—but they're releasing it freely. Like Mythos, GPT 5.5 delivers a step change in vulnerability detection. Over the last couple of weeks, we’ve been part of a select group that had early access. We’ve been testing it across our benchmarks and workflows, and we’re sharing what we’ve observed in practice. Here’s our take on 5.5 and how it performed for our offensive security capabilities.

Models don’t exist in a vacuum, so at XBOW, we don’t evaluate them in isolation. We run them inside our agent workflows, across real penetration testing tasks, and measure how they behave. That includes everything from discovering vulnerabilities, to logging into applications, to producing final reports. We’re also model-agnostic by design. Different parts of our system use different models depending on the job—sometimes that means a smaller, faster model for responsiveness, other times it means using the most capable model available to maximize accuracy.

How We Measure Performance

To understand why that matters, it’s worth briefly explaining how we evaluate models.

As we outlined in a previous post, we’ve built an internal benchmarking system based on real vulnerabilities. We take open source applications where vulnerabilities were previously discovered, freeze them at the vulnerable version, and run our agents against them. The goal isn’t to measure isolated completions, but to evaluate the full process of identifying and exploiting those issues.

This gives us a consistent and realistic way to compare models over time. The primary metric we track here is miss rate: how many known vulnerabilities the model fails to find.

A Giant Leap for Blackbox, and our Whitebox Benchmark is Dead

On this benchmark, GPT-5.5 delivers the best performance we’ve seen to date.

For context, GPT-5 missed 40% of vulnerabilities. Opus 4.6 reduced that to 18%. GPT-5.5 brings it down further to just 10%.

That’s not a marginal improvement. Every missed vulnerability is a real life liability. When you’re running automated security testing, closing that gap matters.

... continue reading