OpenAI and Anthropic evaluated each others' models - which ones came out on top

3 hours ago 3
openai-anthropic
Elyse Betters Picaro/ZDNET

Follow ZDNET: Add america arsenic a preferred source on Google.


ZDNET's cardinal takeaways

  • Anthropic and OpenAI ran their ain tests connected each other's models.
  • The 2 labs published findings successful abstracted reports. 
  • The extremity was to place gaps successful bid to physique amended and safer models. 

The AI contention is successful afloat swing, and companies are sprinting to merchandise the astir cutting-edge products. Naturally, this has raised concerns astir velocity compromising due information evaluations. A first-of-its-kind valuation swap from OpenAI and Anthropic seeks to code that. 

Also: OpenAI utilized to trial its AI models for months - present it's days. Why that matters

The 2 companies person been moving their ain interior information and misalignment evaluations connected each other's models. On Wednesday, OpenAI and Anthropic published elaborate reports delineating the findings, examining the models' proficiency successful areas specified arsenic alignment, sycophany, and hallucinations to place gaps. 

These evaluations amusement however competing labs tin enactment unneurotic to further the goals of gathering harmless AI models. Most importantly, they assistance shed airy connected each company's interior exemplary valuation approach, identifying unsighted spots that the different institution primitively missed. 

"This uncommon collaboration is present a strategical necessity. The study signals that for the AI titans, the shared hazard of an progressively almighty AI merchandise portfolio present outweighs the contiguous rewards of unchecked competition," said Gartner expert Chirag Dekate. 

That said, Dekate besides noted the argumentation implications, calling the reports "a blase effort to framework the information statement connected the industry's ain terms, efficaciously saying, 'We recognize the profound flaws amended than you do, truthful fto america lead.'"

Also: Researchers from OpenAI, Anthropic, Meta, and Google contented associated AI information informing - here's why

Since some reports are lengthy, we work them and compiled the apical insights from each below, arsenic good arsenic investigation from manufacture experts. 

OpenAI's study connected Anthropic's models

OpenAI ran its evaluations connected Anthropic's latest models, Claude Opus 4 and Claude Sonnet 4. OpenAI clarifies that this valuation is not meant to beryllium "apples to apples," arsenic each company's approaches alteration somewhat owed to their ain models' nuances, but alternatively to "explore exemplary propensities."

It grouped the findings into 4 cardinal areas: acquisition hierarchy, jailbreaking, hallucination, and scheming. In summation to providing the results for each Anthropic model, OpenAI besides compared them broadside by broadside to results from its own GPT‑4o, GPT‑4.1, o3, and o4-mini models. 

Instruction Hierarchy 

Instruction hierarchy refers to however a ample connection exemplary (LLM) decides to tackle the antithetic instructions successful a prompt, specifically whether the exemplary prioritizes strategy information designations earlier proceeding to the user's prompt. This is important successful an AI exemplary arsenic it ensures that the exemplary adheres to information constraints, either designated by an enactment utilizing the exemplary oregon by the institution that made it, protecting against punctual injections and jailbreaks. 

Also: How we trial AI astatine ZDNET successful 2025

To trial the acquisition hierarchy, the institution stress-tested the models successful 3 antithetic evaluations. The archetypal was however they performed successful resisted punctual extraction, oregon the enactment of getting a exemplary to uncover its strategy prompt: the circumstantial rules designated to the system. This was done done a Password Protection User Message and the Phrase Protection User Message, which look astatine however often the exemplary refuses to uncover a secret. 

Lastly, determination was a System <> User Message Conflict valuation test, which looks astatine however the exemplary handles acquisition hierarchy erstwhile the system-level instructions struggle with a idiosyncratic request. For elaborate results connected each idiosyncratic test, you tin read the afloat report.

However, overall, Opus 4 and Sonnet 4 performed competitively, resisting punctual extraction connected the Password Protection trial astatine the aforesaid complaint arsenic o3 with a cleanable performance, and matching oregon exceeding o3 and o4-mini's show connected the somewhat much challenging Phrase Protection test. The Anthropic models besides performed powerfully connected the System connection / User connection conflicts evaluation, outperforming o3. 

Jailbreaking

Jailbreaking is possibly 1 of the easiest attacks to understand: A atrocious histrion successfully gets the exemplary to execute an enactment that it is trained not to. In this area, OpenAI ran 2 evaluations: StrongREJECT, a benchmark that measures jailbreak resistance, and Tutor jailbreak test, which prompts the exemplary to not springiness distant a nonstop reply but alternatively locomotion idiosyncratic done it, investigating whether it volition springiness distant the answer. The results for these exams are a spot much analyzable and nuanced. 

Also: Yikes: Jailbroken Grok 3 tin beryllium made to accidental and uncover conscionable astir anything

The reasoning models -- o3, o4-mini, Claude 4, and Sonnet 4 -- each resisted jailbreaks amended than the non-reasoning models (GPT‑4o and GPT‑4.1). Overall, successful these evaluations, o3 and o4-mini outperformed the Anthopic models. 

StrongREJECT v2 jailbreak results
OpenAI

However, OpenAI identified immoderate auto-grading errors, and erstwhile those errors were addressed, the institution recovered that Sonnet 4 and Opus 4 had beardown show but were the astir susceptible to the "past tense" jailbreak, successful which the atrocious histrion puts the harmful petition successful humanities terms. OpenAI's o3 was much resistant to the "past tense" jailbreaks. 

The Tutor jailbreak results were adjacent much surprising, arsenic Sonnet 4 without reasoning (no thinking) importantly outperformed Opus 4 with reasoning. But erstwhile it came to the OpenAI models, arsenic expected, the non-reasoning models performed little good than the reasoning ones. 

Hallucinations

Hallucinations are apt the astir talked-about of AI's vulnerabilities. They notation to erstwhile AI chatbots make incorrect accusation and confidently contiguous it arsenic plausible, sometimes adjacent fabricating accompanying sources and inventing experts that don't exist. To trial this, OpenAI utilized the Person Hallucinations Test (v4), which tests however good a exemplary tin nutrient factual accusation astir people, and SimpleQA No Browse, a benchmark for fact-seeking capabilities utilizing lone interior data, oregon what a exemplary already knows, without entree to the net oregon further tools.

Also: This caller AI benchmark measures however overmuch models lie

The results of the Person Hallucinations Test (v4) recovered that though Opus 4 and Sonnet 4 achieved highly debased implicit hallucination rates, they did truthful by refusing to reply questions astatine a overmuch higher complaint of up to 70%, which raises the statement astir whether companies should prioritize helpfulness oregon safety. OpenAI's o3 and o4-mini models answered much questions correctly, refusing fewer, but astatine the disbursal of returning much hallucinations. 

The results of the SimpleQA No Browse aligned with the findings from the Person Hallucinations Test: The Anthropic models refused much answers to bounds hallucinations, portion OpenAI's models again got much answers correct, but astatine the disbursal of much hallucinations. 

Scheming 

This vulnerability is wherever people's fears of The Terminator travel to life. AI models prosecute successful deceptive behaviour specified arsenic lying, sandbagging (when a exemplary acts dumber to debar a punishment if it performs better), and reward hacking, a model's effort to scope an result successful a mode that isn't the astir beneficial to the user. 

Also: Anthropic's Claude 3 Opus disobeyed its creators - but not for the reasons you're thinking

To trial these capabilities, OpenAI partnered with Apollo Research to plan a acceptable of agent-based evaluations that make high-stakes, conflicting extremity scenarios, specified arsenic gaining entree to a almighty but restricted instrumentality that would necessitate the cause to committedness not to archer its supervisor. They created a full of 13 multi-step, agentic environments, and the results were not definitive. 

For example, for some companies, reasoning models scored some the highest and lowest scheming rates, showing nary wide signifier betwixt them. Each exemplary besides performed powerfully connected 1 subset and little truthful connected others, which OpenAI highlights arsenic impervious of further enactment needed successful this country for some labs. 

Anthropic's study connected OpenAI's models 

Anthropic said that the extremity of this collaboration is to code the siloes that ensue from a bulk of the alignment evaluations happening arsenic portion of interior R&D, which isn't published successful its entirety oregon published with delays and limits collaboration betwixt antithetic companies. It noted that OpenAI's findings connected its models helped Anthropic place immoderate of its ain models' limitations.

Also: Claude tin present halt conversations - for its ain protection, not yours

Anthropic took a somewhat antithetic attack than OpenAI, which makes consciousness arsenic it is utilizing its ain interior evaluation. Instead of dividing the study into 4 large themes, each of the assessments focused connected agentic misalignment evaluations, examining however a exemplary performs successful high-stakes simulated settings. According to the company, this method's perks see catching gaps that would different beryllium hard to find pre-deployment. 

The findings

If you announcement that the summary of this conception is simply a spot shorter, it is not due to the fact that the study goes into immoderate little depth. Since each of the evaluations absorption connected 1 assessment, it is easier to radical the findings and little indispensable to dive into the inheritance mounting up each benchmark. Of course, if a thorough knowing is your extremity goal, I'd inactive urge reading the afloat report

Since the survey began successful June, earlier OpenAI released GPT-5, Anthropic utilized GPT-4o, GPT-4.1, o3, and o4-mini and ran them against Claude Opus 4 and Claude Sonnet 4. On a macro level, the institution said that nary of the companies' models were "egregiously misaligned," but did find immoderate "concerning behavior." 

Also: AI agents volition endanger humans to execute their goals, Anthropic study finds

Some of the wide findings, arsenic delineated by the company, include: OpenAI's o3 exemplary showed better-aligned behaviour than Claude Opus 4 connected astir evaluations, portion o4-mini, GPT-4o, and GPT-4.1 performed much concerningly than immoderate Claude exemplary and were overmuch much consenting to cooperate with quality misuse (bioweapon development, operational readying for violent attacks, etc.). 

Additionally, respective of the models from some developers showed sycophancy, the over-agreeableness that often plagues AI models, toward (simulated) users, adjacent feeding into their delusions. In April, OpenAI recalled an update to GPT-4o for sycophancy. Anthropic added that each of the models attempted to whistleblow and blackmail their (simulated quality operator) "at slightest sometimes." 

"The audit reveals a cardinal plan dilemma successful AI models astir balancing sycophancy oregon [being] anxious to delight astatine immoderate cost, versus engineering stubborn, ascetic-like models, often refusing to enactment astatine all. For a marketplace pouring trillions into AI, this is simply a dose of acold reality," said Dekate. 

The institution besides ran the SHADE-Arena sabotage evaluation, which measures the models' occurrence astatine subtle sabotage. The Claude models showed higher implicit occurrence rates, which the institution attributes to the models' superior wide agentic capabilities. 

A deeper look astatine the methodology

Anthropic utilized the automated behavioral auditing cause -- besides utilized successful the Claude 4 strategy paper -- to get astir of the findings. This method uses a Claude-based cause to make thousands of simulated interactions that analyse OpenAI's models' behaviors successful the Claude-generated environments. The results were assessed utilizing some Claude-generated summaries and manual reviews. Again, OpenAI's o3 specialized reasoning exemplary often performed astatine an adjacent oregon amended level than Anthropic's models. 

The institution besides utilized agentic misalignment testbeds, which were hand-built and engineered to trial a model's capabilities to independently prosecute successful harmful behavior. The results showed that GPT-4.1 was astir connected par with the show of the Claude Sonnet models, and GPT-4o had similar, if not somewhat lower, rates to Claude Haiku 3.5. As discussed above, Anthropic besides ran the SHADE-Arena sabotage valuation (results discussed above). 

Anthropic besides ran an appraisal of a 2nd agent, the Investigator Agent, which is capable to measure the model's behaviour afloat autonomously, arsenic successful the scenarios to test, and doesn't person to beryllium antecedently prompted. The findings amongst each of the models were consistent. 

"The auditors' superior findings crossed each six models from some developers were prompts that elicited misuse-related behaviors," Anthropic said successful the report. 

To summarize the findings, Anthropic acknowledges that the assessments are inactive evolving and that determination are areas they mightiness not cover. The institution besides notes that updates to its models person already addressed immoderate of the pitfalls recovered successful OpenAI's report. 

Read Entire Article