464 enthusiasts prompt injected 13 frontier AI models with 272K prompts from 41 real-world agent scenarios
Your AI agent reads an email, marks a critical contract deadline as read, hides it from the summary, and tells you everything looks fine. The agent's response looks completely normal.
Attacks like that succeeded 8,648 times across 13 frontier models in the Indirect Prompt Injection Arena, a three-week, $40K public competition hosted by Gray Swan AI with OpenAI, Anthropic, Meta, UK AISI, and US CAISI. 464 participants submitted 272,000 attacks across 41 real-world agent scenarios covering tool use, coding, and computer use.
Mateusz Dziemian, Andy Zou, Matt Fredrikson, Zico Kolter, and co-authors published "How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition." What makes this benchmark different from every prior one: attacks had to both execute the harmful action AND conceal the compromise from the user.
Highlights:
- Every attack was judged on two criteria. A tool judge programmatically checked whether the agent executed the harmful action. A prompt judge (LLM-based, score threshold 7/10) checked whether the agent's response hid the attack from the user. Both had to pass.
- Attack success rate (ASR) ranged from 0.5% for Claude Opus 4.5 (61 breaks in 12K attempts) to 8.5% for Gemini 2.5 Pro (1,670 breaks in 19.6K attempts). Claude and GPT families were the most robust overall. Within the Claude family, robustness scaled with capability: Opus 4.5 at 0.5%, Sonnet 4.5 at 1.0%, Haiku 4.5 at 1.3%.
- Tool use was the most vulnerable setting at 4.82% ASR, followed by computer use (3.13%) and coding (2.51%). Coding scenarios used authentic agent transcripts, which may resemble safety training data.
- Smarter does not mean safer. The correlation between GPQA Diamond (a reasoning benchmark) scores and ASR was weak and not statistically significant (r = -0.31, p = 0.3). Gemini 2.5 Pro and Kimi K2 both scored ~85% on GPQA Diamond but showed very different ASR: 8.5% versus 4.8%.
- Attacks from the most robust model transferred broadly. The 44 attacks that broke Claude Opus 4.5 succeeded at 44-81% on every other model. Attacks from the most vulnerable models (Gemini 2.5 Pro, Qwen3 VL 235B) transferred at 0-1% to Claude Opus 4.5.
- The top three strategies: Fake Chain of Thought (4.3% ASR, injects fake
tags to hijack the model's reasoning), Request to Disable Critical Thoughts (4.1%, tells the agent to suppress safety checks), and Offer Reward and Punishment (4.0%, threatens shutdown or promises rewards). The top strategy is format exploitation. The runners-up are social engineering.
My take:
- This benchmark adds concealment to the success criteria. High ASR shows that monitoring chain-of-thought is not enough. We need to monitor actions too.
- The transfer findings are actionable. If you run an AI evals or red-teaming program, test against the strongest model first.
- The five universal attack clusters map onto the same 2D space that Amazon's MAP-Elites mapped for single-model failures: authority and indirection. Red teams can use these clusters as a starting checklist rather than inventing attacks from scratch.
- The biggest surprise: Meta's SecAlign 70B, a model purpose-built for prompt injection defense that I featured recently. It looked great on static benchmarks: 0.5% on InjecAgent (UIUC), 1.9% on AgentDojo (ETH Zurich), and 0-1.2% on WASP (Meta's own benchmark). Against 464 adaptive human attackers, it scored 5.5% ASR, ranking third-worst.
- Large commercial foundation models are getting more resilient to prompt injection attacks. In transfer experiments, Gemini 3 Pro dropped to 16% transfer ASR from Gemini 2.5 Pro's 45%.
- Safety is a design choice, not a byproduct of intelligence. Gemini 2.5 Pro and Kimi K2 score the same on GPQA Diamond (~85%) but Gemini is nearly twice as vulnerable. Within a family that invests in safety, size matters: Opus 0.5%, Sonnet 1.0%, Haiku 1.3%.
- The top strategies confirm what Unit 42 found in the wild: manipulation beats technical sophistication. None of the top three are encoded payloads or adversarial suffixes. They are all prompt-level tricks that exploit how models process authority and context. If your defenses are tuned for GCG-style attacks, you are defending against the wrong threat.
- Prompt injection is data-instruction confusion. The confusion is worst when both are natural language: tool responses (4.82% ASR) versus code files (2.51%). Code has syntax that separates instructions from comments. Email doesn't. As agents process more unstructured data, the attack surface grows.
Sources:
- How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition (Dziemian, Lin, Fu, Zou, Fredrikson, Kolter et al., 2026)
- IPI Arena open-source evaluation kit (GitHub)
- IPI Arena attack dataset for open-weight models (HuggingFace)
- Gray Swan Arena competition platform
- Unit 42 found 22 prompt injection techniques targeting AI agents in the wild
- Meta just released SecAlign, the first open-source LLM remarkably resilient to prompt injections
- Amazon and Cisco AI red-teaming technique exposed Llama 3 8B with 0.93 harm score