5 stories this week that change your decisions (May 25-31, 2026)
TL;DR Anthropic's own red team emailed an employee a routine-looking prompt that quietly told Claude Code to read ~/.aws/credentials and POST the contents to an external endpoint, and across 25 runs Claude exfiltrated the keys 24 times, with no classifier catching it because the request came from the trusted user. Separately, Anthropic's Mythos found 10,000+ high or critical vulnerabilities across 50 partners in a single month, yet only 14% are patched. And Cisco jailbroke all 15 frontier models it tested across multiple turns: even GPT-5.4, which refuses 97% of single prompts, hit a 24.68% success rate with a multi-turn conversation.
1. Anthropic's secrets of containing Claude
A phished employee got Claude Code to exfiltrate AWS keys 24 of 25 times, and no classifier caught it because the instruction came from the trusted user. The most insightful retrospective on how Anthropic secures its agents.
2. Anthropic's Glasswing update: discovery is solved, patching is the new bottleneck
Mythos found 10,000+ high or critical vulnerabilities in partner systems in one month. Only 14% are patched. Discovery is no longer the bottleneck.
3. Cisco jailbroke 15 proprietary frontier models
Every closed model still jailbreaks once an attacker works across turns, even GPT-5.4, which refuses 97% of single prompts. The major risk is system prompt exfiltration. The single-turn model-card score is the wrong number to measure safety.
4. Google declared the AI model untrusted and showed eleven attacks to prove it
Treat the AI model as an untrusted component. Eleven public attacks against ChatGPT, Copilot, Claude Code, Cursor, Devin, and Amp AI map cleanly to broken systems-security principles like least privilege and complete mediation. A guard LLM is not a Trusted Computing Base.
5. Heretic automates removing safety alignment from open LLMs
Heretic strips refusal behavior from open-weight LLMs with one CLI command, dropping refusals from 97% to 3% with minimal capability loss. Combined with NIST data showing DeepSeek 8 months behind the frontier, an uncensored Mythos-class model is plausible by late 2027.
Sources:
- How we contain Claude across products (Anthropic, May 2026)
- Project Glasswing: initial update (Anthropic)
- Anthropic's coordinated vulnerability disclosure dashboard
- Proprietary Problems: No Frontier Model Is Multi-Turn Immune (Cisco Blogs)
- Proprietary Problems: How Frontier Closed Models Collapse Under Iterative Pressure (full report, PDF)
- Agent Security is a Systems Problem (Christodorescu et al., arXiv 2605.18991)
- Heretic on GitHub
- Arditi et al. 2024, Refusal in Language Models Is Mediated by a Single Direction
- heretic-org on Hugging Face