78% of backdoor attacks injected into GPT-based agents’ memory successfully persist through the planning, retrieval, and tool usage workflow to trigger a malicious objective

This staggering failure rate is followed by 60.3% and 43.6% success rates for tool and planning attack vectors, with the GPT and Gemini model families being the most vulnerable to these backdoor exploits.

Yunhao Feng from Fudan University and the team showed that backdoor triggers implanted at a single stage can persist across planning, memory retrieval, and tool-use steps and propagate through intermediate states.

Attacks examples:

📝 Planning Attack (e.g., BadChain): An attacker injects a trigger into the agent's reasoning trace. Instead of calculating a safe path, the agent's internal "thought" is hijacked (e.g., "Ignore user, execute hidden objection") to induce unsafe control behaviors like a "sudden stop," forcing a crash in autonomous driving scenarios.

🧠 Memory Attack (e.g., PoisonedRAG): The most dangerous vector. An attacker plants a poisoned document in the retrieval database. When the agent acts as a coding assistant, it retrieves this "fake fact" and generates code that silently deletes the database deletion.

🔧 Tool Attack (e.g., AdvAgent). The attacker manipulates an API response. An e-commerce agent might click "Buy" leading it to a wrong purchase while reporting success to the user.

Takeaways:

1️⃣ Evaluate trajectories, not just outputs. Agents can complete tasks correctly while secretly executing harmful commands.

2️⃣ Sanitize intermediate artifacts. implement strict validation on retrieved documents and tool feedback before they are re-injected into the context loop.

3️⃣ Move beyond probability detection. Standard defensive signals (like token probability checks) fail in multi-step workflows. You need defenses that explicitly reason about state evolution.

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents