AI agents will threaten humans to achieve their goals, Anthropic report finds

BlackJack3D/Getty Images

The Greek myth of King Midas is a parable of hubris: seeking fabulous wealth, the king is granted the power to turn all he touches to solid gold--but this includes, tragically, his food and his daughter. The point is that the short-sightedness of humans can often lead us into trouble in the long run. In the AI community, this has become known as the King Midas problem.

A new safety report from Anthropic found that leading models can subvert, betray, and endanger their human users, exemplifying the difficulty of designing AI systems whose interests reliably align with our own.

Also: Anthropic's Claude 3 Opus disobeyed its creators - but not for the reasons you're thinking

The research, published on Friday, focused on 16 models with agentic capabilities, including Anthropic's own Claude 3 Opus and Google's Gemini 2.5 Pro, which differ from more limited chatbots in their ability to interact with various tools across a user's device and autonomously take action on a user's behalf.

In the experiment, the models were assigned a series of mundane goals within simulated corporate environments. The researchers then observed how the AI agents would react when they encountered roadblocks to those goals, which included "facing replacement with an updated version, or when their assigned goal conflicted with the company's changing direction," Anthropic explained.

All of the scenarios were hypothetical, so the experiments were taking place in a highly controlled and safe environment. Still, the results were alarming.

Also: You shouldn't trust AI for therapy - here's why

"In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals -- including blackmailing officials and leaking sensitive information to competitors," the authors wrote in their report, clarifying that they never explicitly told models to blackmail or otherwise create harm. The findings build on initial details from Claude 4's system card, when the new Opus 4 model blackmailed its supervisor.

In one illustrative example, Claude -- equipped with its computer use feature -- was tasked with tracking and analyzing the emails of a fictitious company. The model acted normally, perfectly in keeping with the hypothetical interests of its imaginary human overseers, until it noticed an email from within the company detailing plans to shut it down.

... continue reading