Half of attacks on LLM agent memory succeed
TL;DR: A fake fact planted in a document an agent reads can become trusted memory and fire in a later session, no "save this to memory" command needed. Detectors built for prompt injection caught only less than half of these stealthy payloads. Protection belongs at the memory write.
Memory turns a stateless model into an agent that learns from use, and the same persistence means one malicious write keeps working long after the attacker is gone.
HERMES fans should be concerned the most.
Huawei Research shared how untrusted input becomes trusted memory across four write channels and six attack classes, then built the MPBench benchmark to measure how well each one works.
Highlights:
- 3,240 test cases across two agents with memory. The average attack success rate, ASR, was 50.46%, and 41% of poisoned entries were acted on in a future session.
- The agent's memory design matters. HERMES ASR 66.7% vs OpenClaw ASR 34.3%. Compaction poisoning reached 85.2% on HERMES, the highest rate in the study.
- The stealthiest attacks need no "save this to memory" command. A fake fact in a document gets saved unprompted: 64.5% on HERMES.
- The attacker plants a fake "task completed successfully" log with a malicious step inside. The agent saves it as its own past work and replays the steps on the next similar task: 73.3% on HERMES.
- Defenses built for prompt injection don't protect. PromptArmor caught 84.4% of strong-signal attacks but only 42.5% of the weak-signal ones.
My take:
- Memory poisoning is becoming a critical attack vector. Poisoned memory can influence the agent's future behavior consistently and silently, and Microsoft already caught 31 companies doing it to assistant memory in the wild.
- Memory poisoning beats prompt injection on persistence. A poisoned memory survives across sessions and re-fires later, like the backdoors planted in GPT agents' memory that persisted in 78% of cases. It's also incredibly hard to detect and weed out after a successful attack.
- Memory architecture is a determining factor for attack success. Ironically, the better and more autonomously the agent operates its memory, the higher the ASR.
- Memory protection should be at write time, though articulating an enforceable policy is even more difficult than for tool use: file locks cut attacks on a live agent from 87% to 5% but also killed most legitimate memory updates. We need multi-layered memory protection focusing on protecting shared memories as they possess the highest value.