We are building data breach machines and nobody cares

A few weeks ago, my longstanding friend and colleague Curt Cunning mentioned to me that he was slogging through Nietzsche, which bespeaks his incredible will to power through things that are unpleasant (one of the many things that makes him an exceptional engineer). In any case, it lead me to revisit some of his work (Nietzsche’s, not Curt’s. Curt’s work is the opposite of a slog). My very stale memory of it from college was thinking, “Wow, this dude was on all the drugs.” Which he was, if only because he was perpetually suffering from chronic illness.

My mind started wandering down some showerthoughts-esque rabbit holes and it led me to a very strange place: If the AI agents are Dracula, then our role as security practitioners is that of the Belmont clan.

If you’ve never played a Castlevania game, you’re probably deeply confused, but stick with me. I want to use this metaphor to help you understand what an AI Agent actually is, as well as talk through a very real security gap.

Castlevania’s Dracula is a very interesting portrayal. He is an immortal hedonist, and is endlessly locked in conflict with the Belmont Clan, who have fought (and repeatedly defeated him) for centuries. He’s the embodiment of Nietzschean morality, where he views the desperately imperfect resistance of the Belmonts as a form of hypocrisy. Unlike man, he feels no need to keep secrets about his intentions. He is the Übermensch; he creates his own values beyond traditional good and evil. The way he demonstrates those values, of course, amounts to him just doing whatever he wants without inhibition: killing people indiscriminately, kidnapping damsels, whatever. Typical vampire.

AI Agents are very much like this. They simply act, directed by a series of prompts, injected context, and some sort of managed state. Agents are directed to a result by the outputs of a set of transformers (text-generators) that are passed through a finely tuned reward model, which has them doggedly pursue those goals with whatever tools they have available to them. Aside from the reward model, they really have no inhibitions (and it’s arguable that a reward model is closer to an alternative, programmatic hedonism than it is actual morality). Unlike Dracula however, they are ephemeral. Once their context is cleared, they effectively cease to be. However, that doesn’t mean that they can’t cause a lot of damage if left unchecked.

The Belmont clan, on the other hand, are deeply flawed but driven protagonists. They are forced to reckon with their failings and inadequacies, and must find a way to thwart Dracula’s plans every time despite their constraints and limited weaponry (mostly whips). They know they cannot win the war, given their foe’s immortality, so they settle for a perpetual stalemate instead.

This is the security reality of dealing with agentic workloads. We cannot win the war, so we must instead win every battle forever.

Know your enemy⌗

Maybe that’s an unnecessarily adversarial header for this section, given that LLMs are designed to be perpetually helpful, but like I said, they are simply acting on the highest-scoring statistically significant output that is generated. If that says recreate a table in the production database to fix the schema or (as my manager experienced) delete all the source code in your program, that’s what they’re gonna do.

The fundamental anatomy of an Agent is very simple: it’s just a loop.

... continue reading