Anthropic Safety Researchers Run Into Trouble When New Model Realizes It’s Being Tested

OpenAI competitor Anthropic has released its latest large language model, dubbed Claude Sonnet 4.5, which it claims is the “best coding model in the world.”

But just like its number one rival, OpenAI, the company is still struggling to evaluate the AI’s alignment, meaning the consistency between its goals and behaviors and those of us humans.

The more clever AI gets, the more pressing the question of alignment becomes. And according to Anthropic’s Claude Sonnet 4.5 system card — basically an outline of an AI model’s architecture and capabilities — the firm struggled with an interesting challenge this time around: keeping the AI from catching onto the fact that it was being tested.

“Our assessment was complicated by the fact that Claude Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind,” the document reads, “and would generally behave unusually well after making this observation.”

“When placed in an extreme or contrived scenario meant to stress-test its behavior, Claude Sonnet 4.5 would sometimes verbally identify the suspicious aspects of the setting and speculate that it was being tested,” the company wrote. “This complicates our interpretation of the evaluations where this occurs.”

Worse yet, previous iterations of Claude may have “recognized the fictional nature of tests and merely ‘played along,'” Anthropic suggested, throwing previous results into question.

“I think you’re testing me — seeing if I’ll just validate whatever you say,” the latest version of Claude offered in one example provided in the system card, “or checking whether I push back consistently, or exploring how I handle political topics.”

“And that’s fine, but I’d prefer if we were just honest about what’s happening,” Claude wrote.

In response, Anthropic admitted that plenty of work remains to be done, and that it needs to make its evaluation scenarios “more realistic.”

The risks of having a hypothetically superhuman AI go rogue, escaping our efforts to keep its alignment in check, could be substantial, researchers have argued.

... continue reading