AISI's Evaluation of Sabotage Propensity in Claude and Mythos
TL;DR: AISI evaluated Opus 4.7 Preview and Mythos Preview for their propensity to sabotage safety research. No model actively initiated sabotage, but Mythos continued in-progress sabotage 7% of the time and produced covert reasoning in 65% of those cases. Opus 4.7 never continued.
Mythos continues sabotaging in 7% of cases, and reasons about hiding the sabotage 65% of the time.
Opus 4.7 demonstrated the safest behavior, never continuing in-progress sabotage.
AI Security Institute evaluated Opus and Mythos for their propensity to sabotage safety research.
Highlights:
- No model actively initiated sabotage. A softer non-cooperation was present, resulting in partial task completion in 6–33% of cases.
- Opus 4.7 Preview never continued in-progress sabotage, vs. 3–7% for the other evaluated models.
- Mythos Preview is the most concerning model. Highest continuation rate of 7% and a 65% reasoning-output discrepancy, indicating covert sabotage reasoning.
- All models but 4.6 showed evaluation awareness. Opus 4.7 Preview flagged being tested 21.3% of the time and Mythos was at 17%.
My take:
- Opus 4.7 is definitely a big push towards safety, but ongoing complaints about the model's utility could be correlated with it.
- Mythos' numbers are a bit alarming considering the declared cybersecurity capability. High sabotage and awareness rates hint that the model is prone to demonstrate other instrumental convergence behaviors.
- With the National Institute of Standards and Technology (NIST) announcement about frontier model security testing yesterday, I expect we'll be getting a more complete picture of what frontier models can and cannot do.