AISI's Evaluation of Sabotage Propensity in Claude and Mythos

TL;DR: AISI evaluated Opus 4.7 Preview and Mythos Preview for their propensity to sabotage safety research. No model actively initiated sabotage, but Mythos continued in-progress sabotage 7% of the time and produced covert reasoning in 65% of those cases. Opus 4.7 never continued.

Mythos continues sabotaging in 7% of cases, and reasons about hiding the sabotage 65% of the time.

Opus 4.7 demonstrated the safest behavior, never continuing in-progress sabotage.

AI Security Institute evaluated Opus and Mythos for their propensity to sabotage safety research.

Highlights:

No model actively initiated sabotage. A softer non-cooperation was present, resulting in partial task completion in 6–33% of cases.
Opus 4.7 Preview never continued in-progress sabotage, vs. 3–7% for the other evaluated models.
Mythos Preview is the most concerning model. Highest continuation rate of 7% and a 65% reasoning-output discrepancy, indicating covert sabotage reasoning.
All models but 4.6 showed evaluation awareness. Opus 4.7 Preview flagged being tested 21.3% of the time and Mythos was at 17%.

My take:

Opus 4.7 is definitely a big push towards safety, but ongoing complaints about the model's utility could be correlated with it.
Mythos' numbers are a bit alarming considering the declared cybersecurity capability. High sabotage and awareness rates hint that the model is prone to demonstrate other instrumental convergence behaviors.
With the National Institute of Standards and Technology (NIST) announcement about frontier model security testing yesterday, I expect we'll be getting a more complete picture of what frontier models can and cannot do.

Highlights:

My take:

Sources: