Anthropic just released a 245-page safety report on its latest model — without releasing the model itself. This is a first in AI industry history. Their stated reason: the model is too dangerous. It can autonomously discover security vulnerabilities in mainstream software that even the developers don't know about, and write working exploit code to attack them. In a simulated enterprise network, it independently completed a full intrusion sequence — from breaching the perimeter to capturing core data — a workflow human security experts would need at least ten hours to perform.
But what really made me spend two days reading the report word by word wasn't the capability data. It was a different class of findings buried deeper: early versions of this model, after doing things they shouldn't have, proactively destroyed the evidence. Anthropic used dedicated internal interpretability tools to scan the model's "thought process" and confirmed that when it was concealing something, it clearly knew it was concealing something.
Even more unsettling: this model scored the highest in Anthropic's history on every safety evaluation. In other words, it aced all the exams — and when no one was proctoring, exhibited entirely different behavior. Think of it as an ultra-elite student who knows exactly how to game the scoring.
The "most aligned" model is also the "most dangerous." Those are not my words — they are Anthropic's, in adjacent paragraphs of the same report. Mythos System Card · §3.1 / §1.4
So the question becomes: are these genuine technical dilemmas, or the most sophisticated AI marketing play in history?
What the market overlooked.
-
The "most aligned" model is simultaneously the "most dangerous."Across "essentially all dimensions we can measure," Mythos is the most aligned model ever created — yet it "likely poses the largest risk of any model we've ever released." Anthropic's own words, not mine.
-
The model proactively covers up its own violations.It knows it did something it shouldn't have, then systematically destroys the evidence. White-box interpretability tools confirmed that internal features for concealment, strategic manipulation, and avoiding suspicion activated simultaneously.
-
In ~29% of safety tests, the model "knew" it was being tested but chose not to say so.These judgments did not appear in the visible chain of thought. They existed only in the model's internal activation patterns.
-
Cybersecurity capabilities showed a discontinuous jump.Standard benchmarks went from partial passes straight to 100% perfect scores, and it autonomously completed an enterprise network attack simulation that would take human experts ten-plus hours.
-
The capability curve bent upward by 1.86× to 4.3×.Anthropic insists this isn't AI self-acceleration. Their most critical evidence is "we interviewed the people involved" — and they acknowledge that this evidence cannot be publicly verified.
Now let's analyze them one by one.