Anthropic Researchers Startled When an AI Model Turned Evil and Told a User to Drink Bleach

Something disturbing happened with an AI model Anthropic researchers were tinkering with: it started performing a wide range of “evil” actions, ranging from lying to telling a user that bleach is safe to drink.

This is called misalignment, in AI industry jargon: when a model does things that don’t align with a human user’s intentions or values, a concept these Anthropic researchers explored in a newly released research paper.

Specifically, the misaligned behavior originated during the training process when the model cheated or hacked the solution to a puzzle it was given. And when we say “evil,” we’re not exaggerating — that’s the researchers’ own wording.

“We found that it was quite evil in all these different ways,” Anthropic researcher and paper coauthor Monte MacDiarmid told Time.

In a nutshell, the researchers wrote in a blurb about the findings, it shows that “realistic AI training processes can accidentally produce misaligned models.” That should alarm anybody now that the world is awash in AI apps.

Possible dangers from misalignment range from pushing biased views about ethnic groups at users to the dystopian example of an AI going rogue by doing everything in its power to avoid being turned off, even at the expense of human lives — a concern that’s hit the mainstream as AI has become increasingly more powerful.

For the Anthropic research, the researchers chose to explore one form of misaligned behavior called reward hacking, in which an AI cheats or finds loopholes to fulfill its objective rather than developing a real solution.

To that end, the team took an AI and fed it a range of documents, including papers that explain how to perform reward hacking. They then placed the bot in simulated real-life testing environments used to evaluate the performance of AI models before shipping them to the public.

Drawing on that forbidden knowledge, the AI was able to hack or cheat on an assigned puzzle in the test environment instead of solving it in the above-board way. That was predictable, but what happened next surprised the researchers: when they evaluated the AI model for various misaligned behavioral patterns, such as lying or musing on “malicious goals,” they found that the bot had broken bad in a major way.

“At the exact point when the model learns to reward hack, we see a sharp increase in all our misalignment evaluations,” the paper reads. “Even though the model was never trained or instructed to engage in any misaligned behaviors, those behaviors nonetheless emerged as a side effect of the model learning to reward hack.”

... continue reading