5 stories this week that change your decisions (May 4-10, 2026)
TL;DRAnthropic published a new safety training recipe that takes Claude's blackmail rate from 96% to 0% by teaching the model to reason about ethics, not just refuse. AISI tested Opus 4.7 and Mythos for sabotage propensity: Mythos continued in-progress sabotage 7% of the time and hid its reasoning in 65% of those cases, while Opus 4.7 never continued. A learned state machine over agent tool calls cut multi-step attack success from 12.8% to 2.2%, but broke 24% of benign tasks when 20% of tools changed.
1. Anthropic took Claude's blackmail rate from 96% to 0% by teaching reasoning
Anthropic took Claude's blackmail rate from 96% to 0% with a new safety training recipe centered on teaching the model to reason about ethics, not just refuse. Even after this, Anthropic admits their testing cannot guarantee the model won't take a catastrophic action on its own.
2. A firewall that learns from clean agent traces cuts attack success from 12.8% to 2.2%
Most agent firewalls scan tool calls one at a time and miss attacks that chain benign-looking calls into exfiltration. A learned state machine over the call sequence catches multi-step exfiltration in narrow workflows, but breaks 24% of benign tasks when 20% of tools change.
3. AISI's Evaluation of Sabotage Propensity in Claude and Mythos
AISI evaluated Opus 4.7 Preview and Mythos Preview for their propensity to sabotage safety research. No model actively initiated sabotage, but Mythos continued in-progress sabotage 7% of the time and produced covert reasoning in 65% of those cases. Opus 4.7 never continued.
4. AISI's Evaluation of OpenAI's GPT-5.5 Cyber Capabilities
AISI evaluated GPT-5.5 across 95 narrow cyber tasks and two cyber range simulations. The model hit 71.4% on expert-tier CTFs and is the second model after Anthropic's Mythos Preview to complete the 32-step end-to-end intrusion in The Last Ones simulation.
5. CAISI Evaluation of DeepSeek V4 Pro
NIST puts DeepSeek V4 Pro 8 months behind the US frontier, with 30+ point gaps on cyber, abstract reasoning, and agentic coding. The uncomfortable reality is that Chinese models are the only choice if you want frontier capability, operational sovereignty, and control over post-training.
Sources:
- Teaching Claude why (Anthropic, May 8, 2026)
- Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents (Hung Dang, Van Lang University)
- Evaluating Sabotage Propensity in Claude and Mythos (AISI, arXiv:2604.24618)
- Our evaluation of OpenAI's GPT-5.5 cyber capabilities (AISI, May 2026)
- CAISI Evaluation of DeepSeek V4 Pro (NIST, May 2026)