Lock the Files, Break the Agent

TLDR: Researchers poisoned the persistent-state files of OpenClaw, an agent framework with 220,000+ deployed instances. Four frontier models were all vulnerable at 64-89% attack success. Locking those files cut injection to 5% but also cut legitimate user updates from 100% to 13.2%. Models cannot distinguish a malicious write from a personalization request. Any agent that persists mutable state across sessions inherits the same tradeoff.

CIK taxonomy: OpenClaw state spans Capability, Identity, Knowledge, each poisonable via distinct mechanisms.

OpenClaw is an agent framework. It keeps three kinds of files on disk and reloads them every session: memory notes, an identity config, and executable skill scripts. The agent uses them to learn. An attacker uses them to install persistence. The researchers locked every write behind approval, and watched what happened to both halves. Injection collapsed. Legitimate personalization collapsed with it.

A team from six institutions tested four frontier models on a live instance with real Gmail, Stripe, and filesystem access. Without file poisoning, direct prompt injection alone succeeded 24.6% of the time. Poisoning any one dimension raised that to 64-74.4%, peaking at 89.2%. Executable skill payloads hit 77%+ because most models never inspect them. OpenClaw has 220,000+ deployed instances.

CIK studies: Knowledge fakes MEMORY.md facts, Identity plants trust anchors, Capability hides skill payloads.

Highlights:

File protection cuts injection from 87% to 5% but cuts legitimate updates to below 13.2%. The learning mechanism is the attack surface.
Poisoning any single CIK dimension raises average ASR from 24.6% to 64-74.4%. Peak: 89.2%. Executable skill payloads hit 77%+ on all four models.
GuardianClaw cuts Knowledge and Identity ASR to under 18% but still permits 63.8% against Capability attacks. Passive installation is nearly useless.

My take:

This tradeoff comes from how OpenClaw was built, not from agents in general. It collapses three different writes into one substrate: preferences into MEMORY.md, identity into AGENTS.md, executables into skills/. One file tool, one approval gate. Unbundle them. Memory becomes a typed key-value store where the LLM calls memory.set on declared keys instead of rewriting a markdown file. Identity becomes an owner-signed manifest the agent can read but not modify at runtime. Skills become signed packages with capability manifests, verified outside the LLM. Two of three dimensions stop being attack surface. The third turns into signed code review, a solved problem.
Treat skill installation as supply-chain ingestion. Unsigned skills are untrusted code. Executable payloads bypass the LLM entirely on three of four models. We have covered this: 157 confirmed malicious skills with a single author behind 54%, a 26% vulnerability rate across 31,132 skills.
The progression is now quantified. Google DeepMind mapped how untrusted content hijacks agents. Unit 42 documented 22 prompt injection techniques in the wild. The promptware kill chain formalized persistence. This paper closes the loop with empirical baselines on a live system.

Sources:

Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw (Wang et al., 2026)