Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

J Studios/DigitalVision via Getty Images

Follow ZDNET: Add us as a preferred source on Google.

ZDNET's key takeaways

The "Petri" tool deploys AI agents to evaluate frontier models.

AI's ability to discern harm is still highly imperfect.

Early tests showed Claude Sonnet 4.5 and GPT-5 to be safest.

Anthropic has released an open-source tool designed to help uncover safety hazards hidden deep within AI models. What's more interesting, however, is what it found about leading frontier models.

Also: Everything OpenAI announced at DevDay 2025: Agent Kit, Apps SDK, ChatGPT, and more

Dubbed the Parallel Exploration Tool for Risky Interactions, or Petri, the tool uses AI agents to simulate extended conversations with models, complete with imaginary characters, and then grades them based on their likelihood to act in ways that are misaligned with human interests.

The new research builds on previous safety-testing work from Anthropic, which found that AI agents will sometimes lie, cheat, and even threaten human users if their goals are undermined.

... continue reading