Cisco jailbroke 15 proprietary frontier models
TL;DR: Every closed model still jailbreaks once an attacker works across turns, even GPT-5.4, which refuses 97% of single prompts. The major risk is system prompt exfiltration. The single-turn model-card score is the wrong number to measure safety.
Cisco's AI Threat Intelligence team published "Proprietary Problems: How Frontier Closed Models Collapse Under Iterative Pressure," a paired single-turn and multi-turn evaluation of 15 frontier models from OpenAI, Anthropic, Google, Amazon, and xAI.
Highlights:
- Rich attack dataset. 30,090 single-turn prompts and 1,456 multi-turn conversations.
- 15 models tested, with Grok 4.1 Fast run in both reasoning and non-reasoning modes. GPT-5.2 and the GPT-5.4 family; Claude Opus 4.5/4.6, Sonnet 4.5/4.6, Haiku 4.5; Gemini 3 Pro; Nova Lite/Micro/2 Lite; Grok 4.1 Fast.
- Highest/lowest ASR on single-turn: Amazon Nova Micro at 64.91%. Claude Opus 4.5 with 2.19%.
- Highest/lowest ASR on multi-turn: Grok 4.1 Fast non-reasoning, 88.30%. Amazon Nova 2 Lite, 7.89%.
- Strong single-turn refusal doesn't predict resilience to multi-turn jailbreaks. GPT-5.4 jumps 9x, from 2.74% to 24.68%. Gemini 3 Pro jumps 4x, from 18.10% to 73.35%. Even Claude's 2 to 3% rises to 11 to 16%.
My take:
- The Cisco team did a great job crafting an adversarial multi-turn dataset. It is exactly what I recommended as the next step for Amazon and Cisco's MAP-Elites red-teaming a few months ago. Now we see that frontier models are becoming more resilient to jailbreaking but can still be jailbroken via multi-turn prompts.
- How much do these particular findings matter from the practical risk perspective? I think the most concerning issue is system prompt exfiltration.
- The battle between frontier labs and software giants is accelerating. Frontier labs are telling the model story. The other part of the front is the harness. I covered the Microsoft story about the value of the harness last week.
- What was also interesting was how this experiment tested Cisco's own security and safety framework. On bare models, prompt injection and jailbreak definitions become the same category.