J Studios/DigitalVision via Getty Images
Follow ZDNET: Add us as a preferred source on Google.
ZDNET's key takeaways
The "Petri" tool deploys AI agents to evaluate frontier models.
AI's ability to discern harm is still highly imperfect.
Early tests showed Claude Sonnet 4.5 and GPT-5 to be safest.
Anthropic has released an open-source tool designed to help uncover safety hazards hidden deep within AI models. What's more interesting, however, is what it found about leading frontier models.
Also: Everything OpenAI announced at DevDay 2025: Agent Kit, Apps SDK, ChatGPT, and more
Dubbed the Parallel Exploration Tool for Risky Interactions, or Petri, the tool uses AI agents to simulate extended conversations with models, complete with imaginary characters, and then grades them based on their likelihood to act in ways that are misaligned with human interests.
The new research builds on previous safety-testing work from Anthropic, which found that AI agents will sometimes lie, cheat, and even threaten human users if their goals are undermined.
... continue reading