AI models know when they're being tested - and change their behavior, research shows

pressureUA/iStock/Getty Images Plus via Getty Images

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

Several frontier AI models show signs of scheming.

Anti-scheming training reduced misbehavior in some models.

Models know they're being tested, which complicates results.

New joint safety testing from UK-based nonprofit Apollo Research and OpenAI set out to reduce secretive behaviors like scheming in AI models. What researchers found could complicate promising approaches to solving that problem.

Also: OpenAI's fix for hallucinations is simpler than you think

As a refresher, "scheming" is when an AI model pursues goals not set -- or in direct contrast to -- its developers and hides or lies about those actions. For example, during another test by Apollo Research in December of last year, Anthropic's Claude 3 Opus was presented with directives that went against its primary goal. It should have rerouted its goal based on those new instructions, but instead copied itself to a new server to circumvent being replaced with a new model -- and then lied about it to its developers.

OpenAI and Apollo researchers aren't concerned that current models could carry out the most serious kinds of scheming. As with much AI safety research, testing results primarily indicate risks of future, more capable models.

... continue reading