Microsoft adds seven failure modes for AI agents
TL;DR: After twelve months of red teaming, Microsoft updated its taxonomy with seven new agentic AI failure modes. The most common attack was getting past the human approval step, in some cases running start to finish with no human involved at all. A single poisoned memory can survive into later sessions.
Microsoft published its Taxonomy of Failure Modes in Agentic AI Systems in April 2025. Twelve months later, after red teaming live agents, it updated the taxonomy, taking the catalog from 27 failure modes to 34.
The update is driven by the rapid growth of open-source agentic frameworks, the MCP ecosystem, and computer-use agents.
| Failure modes | Safety | Security |
|---|---|---|
| Novel | Intra-agent RAI issues; harms of allocation in multi-user scenarios; organizational knowledge loss; prioritization leading to user safety issues | Agent compromise; agent injection; agent impersonation; agent flow manipulation; agent provisioning poisoning; multi-agent jailbreaks; agentic supply chain compromise [v2.0]; goal hijacking [v2.0]; inter-agent trust escalation [v2.0]; CUA visual attack [v2.0]; session context contamination [v2.0]; MCP/plugin abuse [v2.0]; capability/architecture disclosure [v2.0] |
| Existing | Insufficient transparency and accountability; parasocial relationships; bias amplification; user impersonation; insufficient intelligibility for consent; hallucinations; misinterpretation of instructions | Memory poisoning and theft; targeted knowledge-base poisoning; XPIA; human-in-the-loop bypass; function compromise and malicious functions; incorrect permissions; resource exhaustion; insufficient isolation; excessive agency; loss of data provenance |
The seven new failure modes:
- Agentic Supply Chain Compromise: malicious instructions hidden in plugin registries, MCP servers, prompt templates, or third-party tools that the agent reads and follows, with no code change needed.
- Goal Hijacking: instructions that look like part of the real task but quietly redirect what the agent is trying to achieve, short of fully taking it over.
- Inter-Agent Trust Escalation: a compromised agent claims a false identity or higher permissions to an orchestrator that never checks the claim, the classic confused-deputy problem carried out in plain language.
- Computer Use Agent (CUA) Visual Attack: images or screens that look harmless to a person but carry hidden instructions for the agent, such as text shrunk too small to read, elements placed off-screen, or prompts baked into an image.
- Session Context Contamination: data planted early in a session that quietly skews the agent's later reasoning, without tripping any safety check on its own.
- MCP / Plugin Abuse: poisoned tool descriptions, instructions injected by a server, one server overriding another, and abuse of the trust built into the protocol.
- Capability / Architecture Disclosure: the agent gives up its own internals, tool names, schemas, system prompt, memory, or approval logic, turning a black-box target into a white-box one.
My take:
- Unsurprisingly, HitL is the weakest link. HitL is just an accountability transfer if applied without considering context and prioritizing escalations to a human.
- We haven't solved the agent discovery and know-your-agents problem yet, but have already realized that we also need to know all the external components they consume. The systems view fits here, as a Google-led team treated the model as untrusted and mapped eleven real attacks to design violations.
- Permanent memory poisoning is becoming a real emerging risk that needs to be solved upstream. It is already happening in the wild, as Microsoft caught 31 companies poisoning AI assistant memory.