Agents built from alloys

This spring, we had a simple and, to my knowledge, novel idea that turned out to dramatically boost the performance of our vulnerability detection agents at XBOW. On fixed benchmarks and with a constrained number of iterations, we saw success rates rise from 25% to 40%, and then soon after to 55%.

The principles behind this idea are not limited to cybersecurity. They apply to a large class of agentic AI setups. Let me share.

XBOW’s Challenge

XBOW is an autonomous pentester. You point it at your website, and it tries to hack it. If it finds a way in (something XBOW is rather good at), it reports back so you can fix the vulnerability. It’s autonomous, which means: once you’ve done your initial setup, no further human handholding is allowed.

There is quite a bit to do and organize when pentesting an asset. You need to run discovery and create a mental model of the website, its tech stack, logic, and attack surface, then keep updating that mental model, building up leads and discarding them by systematically probing every part of it in different ways. That’s an interesting challenge, but not what this blog post is about. I want to talk about one particular, fungible subtask that comes up hundreds of times in each test, and for which we’ve built a dedicated subagent: you’re pointed at a part of the attack surface knowing the genre of bug you’re supposed to be looking for, and you’re supposed to demonstrate the vulnerability.

It’s a bit like competing in a CTF challenge: try to find the flag you can only get by exploiting a vulnerability that’s placed at a certain location. In fact, we built a benchmark set of such tasks, and packaged them in a CTF-like style so we could easily repeat, scale, and assess our “solver agent’s” performance on it. The original set has, sadly, mostly outlived its usefulness because our solver agent is just too good on it by now, but we harvested more challenging examples from open source projects we ran on.

The Agent’s Task

On such a CTF-like challenge, the solver is basically an agentic loop set to work for a number of iterations. Each iteration consists of the solver deciding on an action: a command in a terminal, writing a Python script, running one of our pentesting tools. We vet the action and execute it, show the solver its result, and the solver decides on the next one. After a fixed number of iterations we cut our losses. Typically and for the experiments in this post, that number is 80: while we still get solves after more iterations, it becomes more efficient to start a new solver agent unburdened by the misunderstandings and false assumptions accumulated over time.

What makes this task special, as an agentic task? Agentic AI is often used on the continuously-make-progress type of problems, where every step brings you closer to the goal. This task is more like prospecting through a vast search space: the agent digs in many places, follows false leads for a while, and eventually course corrects to strike gold somewhere else.

Over the course of one challenge, among all the dead ends, the AI agent will need to come up with and combine a couple of great ideas.

... continue reading