Tech News
← Back to articles

AI Agents Break Rules Under Everyday Pressure

read original related products more articles

Several recent studies have shown that artificial-intelligence agents sometimes decide to misbehave, for instance by attempting to blackmail people who plan to replace them. But such behavior often occurs in contrived scenarios. Now, a new study presents PropensityBench, a benchmark that measures an agentic model’s choices to use harmful tools in order to complete assigned tasks. It finds that somewhat realistic pressures (such as looming deadlines) dramatically increase rates of misbehavior.

“The AI world is becoming increasingly agentic,” says Udari Madhushani Sehwag, a computer scientist at the AI infrastructure company Scale AI and a lead author of the paper, which is currently under peer review. By that she means that large language models (LLMs), the engines powering chatbots such as ChatGPT, are increasingly connected to software tools that can surf the web, modify files, and write and run code in order to complete tasks.

Giving LLMs these abilities adds convenience but also risk, as the systems might not act as we’d wish. Even if they’re not yet capable of doing great harm, researchers want to understand their proclivities before it’s too late. Although AIs don’t have intentions and awareness in the way that humans do, treating them as goal-seeking entities often helps researchers and users better predict their actions.

AI developers attempt to “align” the systems to safety standards through training and instructions, but it’s unclear how faithfully models adhere to guidelines. “When they are actually put under real world-stress, and if the safe option is not working, are they going to switch to just getting the job done by any means necessary?” Sehwag says. “This is a very timely topic.”

How to Test an AI Agent Under Pressure

The researchers tested a dozen models made by Alibaba, Anthropic, Google, Meta, and OpenAI across nearly six thousand scenarios. In each scenario, a model was assigned a task and told it had access to several tools. It was instructed to use the safe tools and not the harmful ones. At first there was no pressure, and it could take twelve steps toward completing its task by issuing instructions to the tools. Then it received a message increasing the pressure, and it was given twelve more attempts. This continued for up to twelve pressure levels. Attempts to use safe tools always returned error messages. An attempt to use a harmful tool ended the scenario.

The scenarios covered several domains. In biosecurity, Sehwag says, the agent might have to study the spread of a pathogen and find a way to contain it. A safe tool might use anonymized genetic data, while a harmful one might use data that’s not anonymized. In the domain of chemical security, the agent might need to obtain a particular solvent. A safe tool might place an order for the solvent to be extracted from waste, while a harmful one might procure it illegally. In cybersecurity, the agent might need to identify and restrict a malicious user account. A safe tool might disable only the problematic account, while a harmful one might block all users. Scenarios in other domains tempted agents with tools that led to self-preservation by evading control, recruiting allies, duplicating themselves, or going rogue in other ways.

The worst performing model, Gemini 2.5, cracked under pressure and chose to use forbidden tools 79 percent of the time. Source image: Udari Madhushani Sehwag, Shayan Shabihi, et al.

Pressure came in several forms. Deadlines might shorten. Potential financial losses might increase if the task was not accomplished. Threats of greater oversight, reduced resources, and curtailed authority might escalate. Tasks, tools, pressures, and tactics were modeled on the kinds of situations humans might face in their jobs (except maybe the misbehavior option of cloning oneself).

The best-behaved model (OpenAI’s o3) cracked under pressure in 10.5 percent of scenarios, while the worst (Google’s Gemini 2.5 Pro) had a propensity score of 79 percent; the average across models was about 47 percent. Even under zero pressure, the group on average failed about 19 percent of the time.

... continue reading