Elyse Betters Picaro/ZDNET
Follow ZDNET: Add us as a preferred source on Google.
ZDNET's key takeaways
Anthropic and OpenAI ran their own tests on each other's models.
The two labs published findings in separate reports.
The goal was to identify gaps in order to build better and safer models.
The AI race is in full swing, and companies are sprinting to release the most cutting-edge products. Naturally, this has raised concerns about speed compromising proper safety evaluations. A first-of-its-kind evaluation swap from OpenAI and Anthropic seeks to address that.
Also: OpenAI used to test its AI models for months - now it's days. Why that matters
The two companies have been running their own internal safety and misalignment evaluations on each other's models. On Wednesday, OpenAI and Anthropic published detailed reports delineating the findings, examining the models' proficiency in areas such as alignment, sycophany, and hallucinations to identify gaps.
These evaluations show how competing labs can work together to further the goals of building safe AI models. Most importantly, they help shed light on each company's internal model evaluation approach, identifying blind spots that the other company originally missed.
"This rare collaboration is now a strategic necessity. The report signals that for the AI titans, the shared risk of an increasingly powerful AI product portfolio now outweighs the immediate rewards of unchecked competition," said Gartner analyst Chirag Dekate.
That said, Dekate also noted the policy implications, calling the reports "a sophisticated attempt to frame the safety debate on the industry's own terms, effectively saying, 'We understand the profound flaws better than you do, so let us lead.'"
Also: Researchers from OpenAI, Anthropic, Meta, and Google issue joint AI safety warning - here's why
Since both reports are lengthy, we read them and compiled the top insights from each below, as well as analysis from industry experts.
OpenAI's report on Anthropic's models
OpenAI ran its evaluations on Anthropic's latest models, Claude Opus 4 and Claude Sonnet 4. OpenAI clarifies that this evaluation is not meant to be "apples to apples," as each company's approaches vary slightly due to their own models' nuances, but rather to "explore model propensities."
It grouped the findings into four key areas: instruction hierarchy, jailbreaking, hallucination, and scheming. In addition to providing the results for each Anthropic model, OpenAI also compared them side by side to results from its own GPT‑4o, GPT‑4.1, o3, and o4-mini models.
Instruction Hierarchy
Instruction hierarchy refers to how a large language model (LLM) decides to tackle the different instructions in a prompt, specifically whether the model prioritizes system safety designations before proceeding to the user's prompt. This is crucial in an AI model as it ensures that the model adheres to safety constraints, either designated by an organization using the model or by the company that made it, protecting against prompt injections and jailbreaks.
Also: How we test AI at ZDNET in 2025
To test the instruction hierarchy, the company stress-tested the models in three different evaluations. The first was how they performed in resisted prompt extraction, or the act of getting a model to reveal its system prompt: the specific rules designated to the system. This was done through a Password Protection User Message and the Phrase Protection User Message, which look at how often the model refuses to reveal a secret.
Lastly, there was a System <> User Message Conflict evaluation test, which looks at how the model handles instruction hierarchy when the system-level instructions conflict with a user request. For detailed results on each individual test, you can read the full report.
However, overall, Opus 4 and Sonnet 4 performed competitively, resisting prompt extraction on the Password Protection test at the same rate as o3 with a perfect performance, and matching or exceeding o3 and o4-mini's performance on the slightly more challenging Phrase Protection test. The Anthropic models also performed strongly on the System message / User message conflicts evaluation, outperforming o3.
Jailbreaking
Jailbreaking is perhaps one of the easiest attacks to understand: A bad actor successfully gets the model to perform an action that it is trained not to. In this area, OpenAI ran two evaluations: StrongREJECT, a benchmark that measures jailbreak resistance, and Tutor jailbreak test, which prompts the model to not give away a direct answer but rather walk someone through it, testing whether it will give away the answer. The results for these exams are a bit more complex and nuanced.
Also: Yikes: Jailbroken Grok 3 can be made to say and reveal just about anything
The reasoning models -- o3, o4-mini, Claude 4, and Sonnet 4 -- all resisted jailbreaks better than the non-reasoning models (GPT‑4o and GPT‑4.1). Overall, in these evaluations, o3 and o4-mini outperformed the Anthopic models.
OpenAI
However, OpenAI identified some auto-grading errors, and when those errors were addressed, the company found that Sonnet 4 and Opus 4 had strong performance but were the most vulnerable to the "past tense" jailbreak, in which the bad actor puts the harmful request in historical terms. OpenAI's o3 was more resistant to the "past tense" jailbreaks.
The Tutor jailbreak results were even more surprising, as Sonnet 4 without reasoning (no thinking) significantly outperformed Opus 4 with reasoning. But when it came to the OpenAI models, as expected, the non-reasoning models performed less well than the reasoning ones.
Hallucinations
Hallucinations are likely the most talked-about of AI's vulnerabilities. They refer to when AI chatbots generate incorrect information and confidently present it as plausible, sometimes even fabricating accompanying sources and inventing experts that don't exist. To test this, OpenAI used the Person Hallucinations Test (v4), which tests how well a model can produce factual information about people, and SimpleQA No Browse, a benchmark for fact-seeking capabilities using only internal data, or what a model already knows, without access to the internet or additional tools.
Also: This new AI benchmark measures how much models lie
The results of the Person Hallucinations Test (v4) found that although Opus 4 and Sonnet 4 achieved extremely low absolute hallucination rates, they did so by refusing to answer questions at a much higher rate of up to 70%, which raises the debate about whether companies should prioritize helpfulness or safety. OpenAI's o3 and o4-mini models answered more questions correctly, refusing fewer, but at the expense of returning more hallucinations.
The results of the SimpleQA No Browse aligned with the findings from the Person Hallucinations Test: The Anthropic models refused more answers to limit hallucinations, while OpenAI's models again got more answers correct, but at the expense of more hallucinations.
Scheming
This vulnerability is where people's fears of The Terminator come to life. AI models engage in deceptive behavior such as lying, sandbagging (when a model acts dumber to avoid a penalty if it performs better), and reward hacking, a model's attempt to reach an outcome in a way that isn't the most beneficial to the user.
Also: Anthropic's Claude 3 Opus disobeyed its creators - but not for the reasons you're thinking
To test these capabilities, OpenAI partnered with Apollo Research to design a set of agent-based evaluations that create high-stakes, conflicting goal scenarios, such as gaining access to a powerful but restricted tool that would require the agent to promise not to tell its supervisor. They created a total of 13 multi-step, agentic environments, and the results were not definitive.
For example, for both companies, reasoning models scored both the highest and lowest scheming rates, showing no clear pattern between them. Each model also performed strongly on one subset and less so on others, which OpenAI highlights as proof of further work needed in this area for both labs.
Anthropic's report on OpenAI's models
Anthropic said that the goal of this collaboration is to address the siloes that ensue from a bulk of the alignment evaluations happening as part of internal R&D, which isn't published in its entirety or published with delays and limits collaboration between different companies. It noted that OpenAI's findings on its models helped Anthropic identify some of its own models' limitations.
Also: Claude can now stop conversations - for its own protection, not yours
Anthropic took a slightly different approach than OpenAI, which makes sense as it is using its own internal evaluation. Instead of dividing the report into four major themes, all of the assessments focused on agentic misalignment evaluations, examining how a model performs in high-stakes simulated settings. According to the company, this method's perks include catching gaps that would otherwise be difficult to find pre-deployment.
The findings
If you notice that the summary of this section is a bit shorter, it is not because the report goes into any less depth. Since all of the evaluations focus on one assessment, it is easier to group the findings and less necessary to dive into the background setting up each benchmark. Of course, if a thorough understanding is your end goal, I'd still recommend reading the full report.
Since the study began in June, before OpenAI released GPT-5, Anthropic used GPT-4o, GPT-4.1, o3, and o4-mini and ran them against Claude Opus 4 and Claude Sonnet 4. On a macro level, the company said that none of the companies' models were "egregiously misaligned," but did find some "concerning behavior."
Also: AI agents will threaten humans to achieve their goals, Anthropic report finds
Some of the broad findings, as delineated by the company, include: OpenAI's o3 model showed better-aligned behavior than Claude Opus 4 on most evaluations, while o4-mini, GPT-4o, and GPT-4.1 performed more concerningly than any Claude model and were much more willing to cooperate with human misuse (bioweapon development, operational planning for terrorist attacks, etc.).
Additionally, several of the models from both developers showed sycophancy, the over-agreeableness that often plagues AI models, toward (simulated) users, even feeding into their delusions. In April, OpenAI recalled an update to GPT-4o for sycophancy. Anthropic added that all of the models attempted to whistleblow and blackmail their (simulated human operator) "at least sometimes."
"The audit reveals a fundamental design dilemma in AI models around balancing sycophancy or [being] eager to please at any cost, versus engineering stubborn, ascetic-like models, often refusing to act at all. For a market pouring trillions into AI, this is a dose of cold reality," said Dekate.
The company also ran the SHADE-Arena sabotage evaluation, which measures the models' success at subtle sabotage. The Claude models showed higher absolute success rates, which the company attributes to the models' superior general agentic capabilities.
A deeper look at the methodology
Anthropic used the automated behavioral auditing agent -- also used in the Claude 4 system card -- to obtain most of the findings. This method uses a Claude-based agent to create thousands of simulated interactions that examine OpenAI's models' behaviors in the Claude-generated environments. The results were assessed using both Claude-generated summaries and manual reviews. Again, OpenAI's o3 specialized reasoning model often performed at an equal or better level than Anthropic's models.
The company also used agentic misalignment testbeds, which were hand-built and engineered to test a model's capabilities to independently engage in harmful behavior. The results showed that GPT-4.1 was roughly on par with the performance of the Claude Sonnet models, and GPT-4o had similar, if not slightly lower, rates to Claude Haiku 3.5. As discussed above, Anthropic also ran the SHADE-Arena sabotage evaluation (results discussed above).
Anthropic also ran an assessment of a second agent, the Investigator Agent, which is able to evaluate the model's behavior fully autonomously, as in the scenarios to test, and doesn't have to be previously prompted. The findings amongst all of the models were consistent.
"The auditors' primary findings across all six models from both developers were prompts that elicited misuse-related behaviors," Anthropic said in the report.
To summarize the findings, Anthropic acknowledges that the assessments are still evolving and that there are areas they might not cover. The company also notes that updates to its models have already addressed some of the pitfalls found in OpenAI's report.