Anthropic's Mythos is evolving faster than expected, reports AI safety agency

1 hour ago 6
aiburst-gettyimages-2189115060
Eugene Mymrin/ Moment via Getty Images

Follow ZDNET: Add america arsenic a preferred source on Google.


ZDNET's cardinal takeaways

  • The latest mentation of Claude Mythos has already advanced.
  • External researchers recovered that it achieved respective firsts successful testing. 
  • AI capabilities whitethorn beryllium improving overmuch faster than anticipated. 

Anthropic's Claude Mythos, which the institution maintains is excessively almighty to beryllium released generally, already appears to person gained caller capabilities. 

In a blog post connected Wednesday, the UK AI Security Institute (AISI) reported that it had tested a newer mentation of Mythos, which outperformed some its earlier results and OpenAI's GPT-5.5 -- conscionable a period aft Mythos' archetypal release. 

Also: Apple, Google, and Microsoft articulation Anthropic's Project Glasswing to support world's astir captious software

"The newer Mythos Preview checkpoint completed some our cyber ranges, solving the scope 'The Last Ones' successful 6 of 10 attempts and the antecedently unsolved 'Cooling Tower' successful 3 of 10 attempts," the blog authors wrote. "This was the archetypal clip that a exemplary completed the 2nd of our 2 cyber ranges." 

When Anthropic archetypal announced Mythos Preview and Project Glasswing -- the cybersecurity investigating confederation it formed with rival tech companies and AI labs, to which it gave constricted entree to Mythos -- past month, UK AISI evaluated it, uncovering that the exemplary "represents a measurement up implicit erstwhile frontier models successful a scenery wherever cyber show was already rapidly improving." 

That third-party position helped equilibrium claims that the hype astir Mythos was either solely selling or, astatine the different end, signaled a catastrophic displacement successful AI capabilities. The information astir what the exemplary tin bash is apt determination successful the middle. 

Also: How to larn Claude Code for escaped with Anthropic's AI courses - 1 took maine conscionable 20 minutes

AISI's updated trial besides exemplifies that capableness improvements aren't restricted to idiosyncratic exemplary releases, but tin hap wrong versions of a azygous model. 

A rapidly accelerating cyber threat 

AISI noted that AI models are rapidly advancing successful their quality to grip cyber tasks, with superior implications for cybersecurity, particularly fixed Mythos' knack for detecting bundle vulnerabilities

"In February 2026, we internally estimated that the magnitude of cyber tasks AI models could implicit had doubled each 4.7 months since precocious 2024 – already an acceleration from our November 2025 estimation of 8 months," the blog authors wrote. "Since then, AISI reported connected 2 caller models, Claude Mythos Preview and [OpenAI's] GPT-5.5, which substantially exceeded some doubling complaint trends." 

Also: The 3rd large Linux kernel flaw successful 2 weeks has been recovered - acknowledgment to AI

The authors added that it's unclear whether that inclination volition clasp oregon whether these findings bespeak a lasting increase. Mythos and GPT-5.5 could simply beryllium notable breaks from the wide signifier of exemplary evolution. 

Still, AISI clarified that determination are respective unknowns its investigating could not relationship for. The tests capped tasks astatine 2.5 cardinal tokens, which fto researchers amended comparison show results implicit time. That inherently "understates what frontier models tin do," they wrote. 

"Mythos Preview and GPT-5.5 person ample upper-bound mistake bars owed to near-100% occurrence rates connected our constrictive cyber suite's longest tasks, adjacent with the 2.5M token limit," the blog continued. "Our tasks are besides not agelong capable to find however sharply the models' reliability would deteriorate astatine higher task lengths. This places immoderate of the latest models astatine the bounds of what our constrictive trial suite tin measure."

Also: I enactment GPT-5.5 done a 10-round test: It scored 93/100, losing points lone for exuberance

While this makes the constituent of exemplary nonaccomplishment hard to measure, it besides means exemplary occurrence rates connected these tasks would beryllium overmuch higher without the token headdress -- truthful high, successful fact, that "time horizons go intolerable to calculate." Models with much token entree and analyzable cause infrastructure would beryllium overmuch much capable. 

"A 2.5M token bounds is comparatively debased -- successful our cyber scope experimentation we usage up to 100M tokens and find show would apt inactive amended beyond that budget, particularly for caller models, which disproportionately payment from higher token limits," the blog added. 

Read Entire Article