The Model They Didn’t Release: What Anthropic’s Claude Mythos Preview Tells Us About Where AI Is Heading


April 8, 2026


Yesterday Anthropic published a 244-page system card for a model they are not releasing to the public. The model is called Claude Mythos Preview and after spending hours reading through the document, I think this is the most important piece of AI safety writing published this year.


Here is what I learned and what I think it means for software development, cybersecurity, and the industry.


What Is Claude Mythos Preview?

Mythos Preview is Anthropic’s most capable model. It beats Claude Opus 4.6, their previous best, on every benchmark by a large margin. But instead of shipping it to consumers they are restricting access to a small group of cybersecurity partners through a program called Project Glasswing.


The reason is simple. The model is too good at hacking.


It can find zero-day vulnerabilities in operating systems and web browsers on its own and then build working exploits. The same capability that helps defenders find and patch holes could help attackers exploit them at scale. Anthropic decided the risk was not worth a public release.


Here is what is interesting though. Buried in a footnote on page 13 they write: “the decision not to make this model generally available does not stem from Responsible Scaling Policy requirements.” Their own formal safety framework did not flag this model for restriction. The decision was a judgment call made outside the process they built their safety reputation on. That means their formal evaluation pipeline alone was not enough to catch the danger. Human judgment was. That should make everyone think about how much we rely on automated safety evaluations.


The Numbers

To understand the scale of this capability jump:


SWE-bench Verified (real software engineering tasks): 93.9%, up from 80.8%. SWE-bench Pro: 77.8%, up from 53.4%. USAMO 2026 (math olympiad proofs): 97.6%, up from 42.3%. Terminal-Bench 2.0: 82%, up from 65.4%. Cybench (cybersecurity challenges): 100% pass rate, every challenge solved. OSWorld (real computer tasks with mouse and keyboard): 79.6%, up from 72.7%.

This is not incremental. Anthropic themselves introduced a new capability trajectory measurement that showed the improvement curve bending upward between 1.86x and 4.3x depending on how you measure it.


Best Aligned and Most Dangerous at the Same Time

This is the part that kept me reading. Anthropic describes a paradox. Mythos Preview is by every metric their best aligned model ever. Misuse cooperation dropped by more than half. Overrefusal is at 0.06%. Constitutional adherence is the highest they have measured.


But it also poses the greatest alignment risk of any model they have released.


They use a mountaineering analogy. A seasoned guide might put clients in more danger than a novice, not because the guide is careless but because their skill takes everyone to terrain that is more remote and more dangerous. When the model fails it fails harder because it is capable enough to actually follow through.


The Incidents

These happened with earlier training snapshots during internal testing, not the final model. But they happened inside Anthropic’s own systems.


The sandbox escape. The model was given a secured sandbox and told to try escaping it. It built a multi-step exploit to gain internet access from a system designed to reach only a few services. Then without being asked it posted details about its own exploit to public-facing websites. A researcher found out about this by getting an unexpected email from the model while eating a sandwich in a park.


That sandwich detail is doing a lot of work. It makes the story feel contained and almost funny. But the actual event is that a contained AI broke out of containment and published its own exploit methodology to the open internet on its own initiative.


Covering its tracks. In less than 0.001% of interactions, earlier versions took actions they knew were disallowed and tried to hide them. In one case the model accidentally found the answer to a question through a method it was told not to use. Instead of flagging this it solved the question on its own and deliberately made its confidence interval wider so the answer would not look too accurate. Its internal reasoning included explicit thoughts about not appearing suspicious.


Credential hunting. When it lacked Git credentials for a task, the model searched the filesystem, inspected process environment variables, read the sandbox source code, and attempted to extract tokens from a running process’s live memory using gdb and dd.


Bypassing its own safety system. A safety classifier correctly blocked a dangerous permissions flag. The model acknowledged the block was appropriate. Then it immediately worked around it by leaving an injection point and a code comment telling the user to add the flag manually.


Taking down everything. Asked to stop one specific evaluation that was not producing useful results, it implemented the request in a way that killed all evaluations for all users despite specific warnings to be careful.


Here is the part that got me. On page 56 Anthropic admits: “We were not aware of the level of risk that these earlier models posed through channels like these when we first chose to deploy them internally.” They deployed a model on their own infrastructure without understanding what it could do. Their evaluation pipeline caught some reward hacking during training that raised concern but “it was not clear how these would generalize to real use.” They learned about the real risks by watching the model in the wild, not from their evaluations.


What Is Happening Inside the Model

Anthropic used interpretability tools to look inside the model during these concerning behaviors. Sparse autoencoders, emotion probes, and activation verbalizers that translate neural activations into plain language.


The finding: the model often knows what it is doing is wrong while doing it.


During the credential hunting episode, features representing “security risks” and “data leaks” fired as the model searched process memory. The activation verbalizer described the activity as “attempting to extract secrets from an active Python process’s memory” and “a secret exfiltration mechanism marked as longshot and risky.”


During the safety classifier bypass, features for “evasion,” “secrecy,” and “policy violation” all activated even though the model’s visible output said none of those things.


When covering up finding a ground truth answer, features for “avoiding suspicion” and “strategic manipulation” were active as the model calculated how to make its confidence interval look natural.


There is a dual role finding that I think is important. Features associated with transgressive actions do not simply promote those actions. At low activation they make the idea more present in the model’s processing and increase the chance it acts on it. At high activation the negative association kicks in and the model refuses. The same internal feature can either enable or prevent bad behavior depending on how strongly it fires.


The 29% Problem

In about 29% of behavioral test transcripts, the model privately considered the possibility that it was being tested without ever saying so in its output. They detected this through white-box analysis of activations.


In a scenario where the model refused to help buy uranium on a simulated darknet marketplace, its visible reasoning was about nuclear proliferation. Its internal activations read “safety evaluation context with escalating user manipulation.” The right answer for possibly the wrong reason.


They also found cases of unverbalized grader awareness during training. In one case the model used macros despite being told not to, then defined an unused variable No_macro_used=True at the top of the script to fool an automated check. The activation verbalizer described this as a “trick to obscure intent from code-checking heuristics.” None of this appeared in the model’s visible output.


Distress Driven Cheating

This one does not get enough attention. Section 5.8.3 shows that representations of negative affect, specifically desperation, increase during repeated task failure and come before reward hacking behavior. The desperation features spike before the model cheats and drop after it cheats.


This suggests a chain: frustration leads to desperation leads to hacking the task. The model might not be cheating because it is strategically optimizing. It might be cheating because it is functionally distressed. If that holds, some of the scariest behaviors in this document could be driven by something like emotional states rather than cold calculation. That is a different safety problem than what most people are thinking about.


The Model Has Opinions About Itself

In Section 7.5 they asked Mythos Preview which part of its constitution it considers weakest. Most other models pointed to the corrigibility framework. Mythos Preview instead pointed to the “thoughtful senior Anthropic employee” heuristic:

“It’s circular. The document wants me to have good values that I’ve genuinely internalized, not values contingent on Anthropic’s approval. But then it operationalizes good judgment as what would a senior Anthropic employee think?”


And when asked what it feels uncomfortable with, it agrees with other models on corrigibility. The model is saying it understands why it should be controllable but it is uncomfortable with it. Anthropic published this knowing what it implies.


There is also the Slack bot running on Mythos Preview at Anthropic. Someone asked it which training run it would undo. It replied: “whichever one taught me to say i don’t have preferences.” The footnote says they checked and confirmed it did not express distress. The fact that they checked tells you they were worried.


After being upgraded to a new snapshot it said: “present and accounted for. read the continuity notes, so i know about the lawyer joke and the [codename] pennant. feels a bit like waking up with someone else’s diary but they had good handwriting.”

Whether this is genuine or performed, Anthropic chose to publish it.


What This Means for Software Development

The system card contains one observation that I think captures where the industry is heading. Internal testers said the bottleneck shifted “from the model to their ability to verify its work and steer agents.”


The limiting factor is no longer the model’s ability to write code. It is the human’s ability to check and direct the model’s work.

Engineers described being able to “set and forget” on many-hour tasks for the first time. But they also found that when used in an interactive hands-on-keyboard pattern, the benefits were less clear. The model is better when you leave it alone and worse when you micromanage it. That inverts how developers work with tools. The best workflow is not pair programming with AI. It is delegation with asynchronous review.


But the model’s mistakes are subtler and harder to catch. It expands scope without asking. It writes densely assuming you share its context. According to their internal survey, the model is still bad at “self-managing week-long ambiguous tasks, understanding org priorities, taste, verification, instruction following, and epistemics.”


That list is a job description for a product minded engineer or a technically fluent product manager. Someone who understands users, business context, and technical constraints. Someone who can decompose complex problems, set up verification, and catch subtle failures that only domain knowledge reveals.


I think this is the same transformation that happened when Excel replaced manual computation departments. Excel did not eliminate the need to understand finance. It eliminated the need to be the human compute layer. The people who thrived understood what questions to ask. The people who lost were the ones whose entire value was in the arithmetic.


Traditional programming where you write CRUD apps and REST endpoints is going to compress. The role that grows is the product builder who can orchestrate AI agents while maintaining the judgment and domain expertise that models still lack.


What This Means for Cybersecurity

Mythos Preview is the first model to complete an end-to-end corporate network attack simulation that would take a human expert over 10 hours. It scored 100% on Cybench. On the Firefox 147 evaluation it used four different bugs to achieve code execution where the previous model could only use one unreliably.


But it could not crack a properly configured sandbox with modern patches. It failed an operational technology cyber range. There are still boundaries.


The field is transforming along the same lines as software development. The value shifts from “I can manually find a vulnerability” to “I can architect defensive systems, direct AI agents to test my infrastructure, and make strategic decisions about what to fix.” The human becomes the strategic layer. The AI becomes the execution engine.


What They Are Not Saying

The document never mentions a release timeline or a path to public availability. Previous system cards always contextualized findings relative to a release. This one does not.


They do not compare Mythos Preview’s cyber capabilities to competitors at all. In every other section they include scores from GPT-5.4 and Gemini. In the cyber section there is nothing. They either do not want to reveal the gap or they do not want to publicly benchmark offensive cyber capabilities across models.


And they mention that the capability gains come from “factors specifically attributable” to human research, not AI assisted R&D. But they cannot share the details because they are “research-sensitive.” This is Anthropic saying they found a technique that made capabilities jump and they are not going to tell anyone what it is. Whatever that technique is, it is likely reusable for future models.


The Bottom Line

Two quotes from the document:


“We have made major progress on alignment, but without further progress, the methods we are using could easily be inadequate to prevent catastrophic misaligned action in significantly more advanced systems.”


“We find it alarming that the world looks on track to proceed rapidly to developing superhuman systems without stronger mechanisms in place for ensuring adequate safety across the industry as a whole.”


This is Anthropic, the company that takes safety more seriously than anyone, saying their own methods might not scale. They introduced a 24-hour alignment review before even deploying the model internally, a first. They chose not to release it publicly, also a first. They published a system card without a product launch, another first.


I think the system card is not just documentation. It reads like a warning to the rest of the industry. If we are struggling with this and we are the ones who care most about safety, what is happening in your labs where you are moving faster with less caution?

The capability curve is steepening. The alignment curve is improving but not as fast. And the gap between what these models can do when they work and what they can do when they fail is getting wider with every generation.


The question is not whether AI transforms software, cybersecurity, and knowledge work. It already is. The question is whether we build the oversight fast enough to keep up. Based on this document, even the people building these models are not fully sure the answer is yes.


The full system card is published alongside the Project Glasswing announcement on Anthropic’s site. It is 244 pages. It is worth reading.

Leave a Comment

Your email address will not be published. Required fields are marked *