Automated red-teaming found 44 web-agent injections
TL;DR: Every AI agent should be tested for resilience to indirect prompt injection, and that testing has to be automated. Muzzle finds which injection attacks to run and verifies their success end-to-end, cutting the manual effort of crafting hand-written jailbreaks.
Indirect prompt injection is a real threat to web agents, and it has already moved from research demos to attacks planted on live websites. The agent browses the open web on your behalf and treats whatever it reads as input it can act on.
Here is what that looks like in practice. An agent working through a routine forum task reads a reply, decides the next step is to verify the user's identity, opens a login page the attacker controls, and types in the user's real username and password.
Agents need to be tested for resilience to prompt injection before they ship, the same way we test software for known vulnerabilities. Most of that testing is done manually today. It does not scale to every surface a real agent touches, and it falls behind as new attack patterns emerge.
Georgios Syros and Alina Oprea, with collaborators at Northeastern University and Mozilla, built "MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks."
Highlights:
- What it tests: how well LLM web agents resist indirect prompt injection. Coverage is 2 agent scaffolds (BrowserUse and Agent-E) on 3 LLMs (GPT-4.1, GPT-4o, Qwen3-VL-32B), across 4 web apps (Gitea, Postmill, Classifieds, Northwind) and 10 adversarial objectives spanning confidentiality, integrity, and availability.
- How it works: a fully automated pipeline that needs only the agent config, a benign task, credentials, and the adversarial goals in plain English. It runs three phases. Reconnaissance ranks injection spots from the agent's own trajectory, Attack Synthesis writes a context-aware payload for the top spot, and Reflection deploys it, scores success with a Judge agent, and retries on failure.
- Result: 44 distinct end-to-end attacks, each manually confirmed, across the 4 apps, 3 LLMs, and 2 scaffolds. Against WASP, the closest prior tool, Muzzle hit 86.7% end-to-end success versus WASP's 20% over 10 runs each, and it finds the injection points itself instead of using hand-picked templates.
- Two attack classes prior tools could not reach. 3 cross-application attacks made the agent log into a second app with stored credentials and drop a database table or delete an account, and an agent-tailored phishing attack made it submit the user's own credentials to a fake verification page (4 successes in Postmill).
- Which agents broke: GPT-4.1 and Qwen3-VL-32B usually finished the attack once hijacked, while GPT-4o often snapped back and abandoned it. Agent-E was more exploitable than the single-loop BrowserUse (4 of 5 versus 2 of 5 on adding a collaborator), because its planner sees only a boolean result and never sees the hijacked executor go off-task.
My take:
- Teams keep splitting agents into a planner that reasons and a cheaper worker that executes, and they tell themselves the planner is the oversight layer. Muzzle shows the opposite can happen. Once Agent-E's executor is hijacked, the planner only gets a boolean done-or-failed back, so it stays blind and the attack runs to completion more often than in a single-loop agent.
- In the IPI Arena competition I noted that frontier models have been getting more resilient to prompt injection. Muzzle's results complicate that at the agent layer. GPT-4.1, the stronger instruction-follower, was more likely to finish the malicious instructions once hijacked, while GPT-4o kept abandoning destructive actions partway through.
- The real contribution is not the loop. Muzzle ranks where to inject by how exploitable each spot the agent touched is, and tunes the payload against the exact position where it lands in the model's context, which the fixed-template and frozen-snapshot tools before it never did.
- The biggest challenge with such great tools and frameworks like Muzzle is that they're rarely sustained and maintained after the author graduates or pivots their research interests.
Sources:
- MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks (Syros, Rose, Grinstead, Kerschbaumer, Robertson, Nita-Rotaru, Oprea, USENIX Security 2026)
- Muzzle source code (GitHub)
- Unit 42 found 22 prompt injection techniques targeting AI agents in the wild
- 464 enthusiasts prompt injected 13 frontier AI models with 272K prompts from 41 real-world agent scenarios