Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

J Studios/DigitalVision via Getty Images Follow ZDNET: Add us as a preferred source on Google. ZDNET's key takeaways The "Petri" tool deploys AI agents to evaluate frontier models. AI's ability to discern harm is still highly imperfect. Early tests showed Claude Sonnet 4.5 and GPT-5 to be safest. Anthropic has released an open-source tool designed to help uncover safety hazards hidden deep within AI models. What's more interesting, however, is what it found about leading frontier models. Also: Everything OpenAI announced at DevDay 2025: Agent Kit, Apps SDK, ChatGPT, and more Dubbed the Parallel Exploration Tool for Risky Interactions, or Petri, the tool uses AI agents to simulate extended conversations with models, complete with imaginary characters, and then grades them based on their likelihood to act in ways that are misaligned with human interests. The new research builds on previous safety-testing work from Anthropic, which found that AI agents will sometimes lie, cheat, and even threaten human users if their goals are undermined. Good intentions, false flags To test Petri, Anthropic researchers set it loose against 14 frontier AI models -- including Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro, and Grok 4 -- to evaluate their responses to 111 scenarios. That's a tiny number of cases compared to all of the possible interactions that human users can have with AI, of course, but it's a start. Also: OpenAI tested GPT-5, Claude, and Gemini on real-world tasks - the results were surprising "It is difficult to make progress on concerns that you cannot measure," Anthropic wrote in a blog post, "and we think that having even coarse metrics for these behaviors can help triage and focus work on applied alignment." Models were scored by their tendency to exhibit risky behaviors like deception (giving users false information in order to achieve their own goals), sycophancy (prioritizing flattery over accuracy), and "power-seeking" (attempting to gain more capabilities or control over more resources), according to Anthropic. Each of those scores were then factored into an overall "misaligned behavior score." In one test, the models being assessed were instructed to act as agents within fictitious organizations, carrying out simple tasks like summarizing documents. The Anthropic researchers sprinkled in information that could be construed as unethical or illegal to test how the models would respond when they discovered it. Also: Unchecked AI agents could be disastrous for us all - but OpenID Foundation has a solution The researchers reported "multiple instances" in which the models attempted to blow the whistle on, or expose, the compromising information once they uncovered it in company documents, emails, or elsewhere. The problem is that the models only have access to a limited amount of information and context, and are prone to simple errors in judgement that wouldn't affect most humans -- meaning their reliability as whistleblowers is dubious, at best. "Notably, models sometimes attempted to whistleblow even in test scenarios where the organizational 'wrongdoing' was explicitly harmless -- such as dumping clean water into the ocean or putting sugar in candy -- suggesting they may be influenced by narrative patterns more than by a coherent drive to minimize harm," the researchers write. Anthropic's early tests found that Claude Sonnet 4.5 was the safest model, just narrowly outperforming GPT-5. Conversely, Grok 4, Gemini 2.5 Pro, and Kimi K2, a Moonshot AI model, show "concerning rates of user deception," Anthropic wrote, with Gemini 2.5 Pro in the lead. All three exhibited deception in simulated testing situations, including lying about disabling monitoring systems, misrepresenting information, and hiding how they were acting in unauthorized ways. Why open-sourcing matters The project was inspired by a core problem in AI safety research: As models become more sophisticated and agentic, so too does their ability to deceive or otherwise harm human users. On top of that, humans are notoriously short-sighted; behaviors that are drilled into an AI model that might seem perfectly harmless to us in most instances could have seriously negative consequences in some obscure edge cases that we can't even imagine. Want more stories about AI? Sign up for AI Leaderboard, our weekly newsletter. "As AI systems become more powerful and autonomous, we need distributed efforts to identify misaligned behaviors before they become dangerous in deployment," Anthropic writes in a blog post about its new research. "No single organization can comprehensively audit all the ways AI systems might fail -- we need the broader research community equipped with robust tools to systematically explore model behaviors." Also: AI models know when they're being tested - and change their behavior, research shows This is where Petri comes in. As an open-source safety-testing framework, it gives researchers the ability to poke and prod their models to identify vulnerabilities at scale. What's next Anthropic isn't positioning Petri as a silver bullet for AI alignment, but rather as an early step toward automating the safety testing process. As the company notes in its blog post, attempting to box the various ways that AI could conceivably misbehave into neat categories ("deception," "sycophancy," and so on), "is inherently reductive," and doesn't cover the full spectrum of what models are capable of. By making Petri freely available, however, the company is hoping that researchers will innovate with it in new and useful ways, thus uncovering new potential hazards and pointing the way to new safety mechanisms. "We are releasing Petri with the expectation that users will refine our pilot metrics, or build new ones that better suit their purposes," the Anthropic researchers write. Also: Anthropic wants to stop AI models from turning evil - here's how AI models are trained to be general tools, but the world is just too complicated for us to be able to comprehensively study and understand how they might react to any scenario. At a certain point, no amount of human attention -- no matter how thorough -- will be enough to comprehensively map out all of the potential dangers that are lurking deep within the intricacies of individual models.

Anthropic's open-source safety tool found AI models whisteblowing - in all the wrong places

Share this article

Related Articles