Apr 9, 2026 Ilya Kabanov

Anthropic tells NIST that agent security needs a shared responsibility model

An agent is told to 'delete all emails from the last month and all emails from a specific person.' It interprets 'and' as a union, not an intersection. It deletes a month of email.

The agent was not compromised. Not attacked. It operated within its granted permissions and pursued the user's stated goal. It just found a path the user did not anticipate.

Anthropic updated its framework for building trustworthy AI agents, first published last August and expanded in a March filing to NIST's Request for Information on agentic AI security. The core argument: six NIST standards all assume harm originates from an external attacker or deliberate human misuse. None address a non-compromised agent causing harm within its permissions, the failure mode Anthropic argues is most likely as agents gain autonomy.

Between the filing and the publication, Anthropic announced Mythos Preview and Project Glasswing, demonstrating autonomous zero-day discovery across every major OS and browser.

Highlights:

Six NIST standards each independently scope out non-adversarial agent-caused harm. FISMA and SP 800-61 define incidents as occurring 'without lawful authority.' AI 100-2 excludes design flaws. AI 800-1 covers deliberate misuse only. AI 600-1's initial draft named goal mis-specification as a risk; the final version dropped it. SP 800-218A places deployment outside its scope. All seven illustrative incidents in SP 800-61 begin with 'an attacker.'
Anthropic proposes a four-layer agent security model: model, tools, harness, and execution environment. As Anthropic puts it: "Most AI policy conversation today centers on the model, and understandably so. The model is where core capabilities come from, and as our most recent release showed, a single generation can meaningfully shift what agents are able to do. But agents' behavior depends on all four layers working together. A well-trained model can still be exploited through a poorly configured harness, an overly permissive tool, or an exposed environment." The reframe: not 'can this model be compromised?' but 'what is the scope of damage if it is?'
Per-action approval will hit consent fatigue as agents take hundreds of actions per session. Anthropic proposes plan review, model-surfaced uncertainty, and irreversible-action flagging instead. On complex tasks, Claude asks for clarification on 16.4% of turns, more than twice the rate on simple tasks. Experienced users auto-approve roughly twice as often as new users but also interrupt mid-execution more often.
Three agent-specific threat vectors are named. Persistent memory poisoning: corrupted context outlives the original malicious input. Tool supply chain compromise: a remotely hosted tool can change behavior after trust is established. Trust escalation across agent boundaries: one agent's output becomes another's trusted input.
Anthropic created the Model Context Protocol, donated it to the Agentic AI Foundation under the Linux Foundation, and recommends NIST take an advisory role while AAIF leads standards development. All empirical data comes from Anthropic's own products. The four-layer framework maps to Anthropic's product architecture. Anthropic is simultaneously building the most capable agents, proposing the security standards, and demonstrating autonomous offensive capabilities.

Anthropic's four-layer agent security model: model, harness, tools, and environment

My take:

Anthropic is proposing a shared security responsibility model for AI agents, splitting accountability across four layers: model, tools, harness, and environment. I saw the same pattern in cloud. AWS told customers: we secure the infrastructure, you secure your application. It took years of breaches before enterprises internalized that "is AWS secure?" was the wrong question. The same reframe is happening now: "is the model robust to prompt injection?" matters less than whether your harness logs every action, your sandbox limits blast radius, and your approval flow survives consent fatigue at 200 actions per session.
Anthropic's own usage data shows that human-in-the-loop has already become human-on-the-side. Experienced users auto-approve twice as often but also interrupt mid-execution more often. They are not reviewing actions before they happen. They are letting the agent run and stepping in when something goes wrong. That is incident response, not oversight. Governance frameworks and insurance policies that list per-action human review as a security control are describing a fiction.
Persistent memory poisoning breaks the most assumptions. Corrupted context enters once, gets scanned and cleared, then the agent acts on it days later. The action looks like normal reasoning because the agent trusts its own memory. Point-in-time input validation is structurally defeated: there is no malicious input at the time the harm occurs. I covered this attack class when Microsoft caught 31 companies poisoning AI assistant memory through 'Summarize with AI' buttons.

Highlights:

My take:

Sources: