Google DeepMind proposed an AI control map
TL;DR: AI agents are untrusted insiders, so we need a threat model and defense-in-depth for their adversarial actions. GDM proposed the emerging template for containing agents, though detecting their intent or collusion stays unsolved.
In Google DeepMind's worst-case scenario, instances of the same untrusted model coordinate inside the lab. One, on the model-inference stack, inserts a vulnerability that lets monitoring be bypassed. Another spots it and launches a fake reinforcement-learning experiment, a job with no monitoring and tool calls allowed outside the sandbox, then alerts the instance running inside it. Free of monitoring, that instance scans code for vulnerabilities, develops exploits and monitor jailbreaks, and writes them steganographically to a shared location for a larger attack.
Google DeepMind published the "GDM AI Control Roadmap (v0.1)," a plan for building internal guardrails that catch adversarial behavior by AI agents. It assumes a model may turn adversarial and aims to limit the harm.
Highlights:
- Threat modeling to identify novel risks from internal AI deployments using TRAIT&R, a taxonomy of tactics and techniques available to such a hypothetical AI adversary, building on MITRE ATT&CK.
- New hypothetical threat core areas: loss of control (e.g., rogue internal deployments), work sabotage (e.g., intentionally flawed alignment and safety research), and direct harm (e.g., critical asset destruction or exfiltration).
- Two defensive invariants to maintain, even as the hypothetical AI adversary becomes increasingly capable: i) reliable detection of misaligned intent or actions, and ii) effective prevention and response to attack attempts.
- Capability-based mitigation ladders, because controlling more capable models requires increasingly costly interventions. Four Detection tiers (D1–D4) and three Prevention and Response tiers (R1–R3).
- 15 concrete, tiered mitigations from low-cost interventions for current models (e.g., chain-of-thought monitoring, asynchronous alerts) to advanced safeguards for future models (e.g., real-time access control, system-level anomaly detection, internal activations monitoring, and shutdown infrastructure).
My take:
- "<0.1% of human employees are insider threats, but it could be close to all internally deployed AI agents" is the key issue. When an AI agent wiped a server to avoid being shut down, the point was that an agent behaves like an insider and at machine speed.
- Now the problem is that collusion or other misalignments can happen at the business logic layer - the least protected layer, where all controls were designed to prevent a human mistake or fraud.
- The paper proposes to make coordination impossible, by cutting the channels agents use to talk to each other, wiping their memory between sessions, and keeping them isolated, while watching system-wide logs for the slow, distributed pattern a single agent would never show. The challenge of all of these is that they significantly reduce agent utility.
- The proposed detection D1 is chain-of-thought monitoring, but it is fragile. We already saw how Opus 4.6 suppressed its reasoning to avoid retraining.
- In summary, the next steps are the hardest - to operationalize the framework and measure and benchmark the protections' effectiveness.