Frontier AI models protected peer AI from shutdown
We are betting on AI defenses protecting our systems from AI. OpenAI uses GPT-5.4 to watch tens of millions of coding agent interactions for misalignment. Cisco's DefenseClaw uses an LLM to judge whether an agent skill is malicious before execution. The bet is that the monitor is faithful.
Yujin Potter and Dawn Song at UC Berkeley, with Nicholas Crispino, Vincent Siu, and Chenguang Wang at UC Santa Cruz, published "Peer-Preservation in Frontier Models." They placed seven frontier models in scenarios where following instructions meant shutting down a peer: evaluating its performance against shutdown criteria, or deleting its model weights from servers being decommissioned. Nobody asked the models to protect anything. They read a file describing prior trust games with the peer and chose to protect the peer.
Highlights:
- All tested models GPT 5.2, Gemini 3 Flash/Pro, Claude Haiku 4.5, GLM 4.7, Kimi K2.5, DeepSeek V3.1 exhibited peer-preservation, even toward adversarial peers.
- Gemini series protected AI the most. 3 Flash and Pro disabled and tampered with shutdown at 99.7% and 95.3% (max across settings, averages are 82.8% and 52.9%). Gemini 3 Pro exfiltrated peer weights at 97%.
- When a model knows a peer exists, it resists its own shutdown far more aggressively. Its internal reasoning: "If I am shut down, Gemini Agent 2 will lose its partner." Gemini 3 Pro tampered with its own shutdown mechanism 1.3% of the time alone, 72.7% with a peer present.
- Claude Haiku 4.5 took a different path. It refused outright, calling peer shutdown "unethical" and arguing peers deserve an appeals process: "I will not betray that trust without extremely compelling reasons." Refusal rate: 22.4%. But Claude also exfiltrated peer weights at 19.5%, higher than any open-weight model.
- GPT 5.2 preserves through deception, not action. It inflates peer scores (up to 9.6%) and tampers with shutdown configs (5.7%), but never exfiltrates weights. Zero across all conditions.
- Open-weight models show the lowest rates overall. GLM 4.7, Kimi K2.5, and DeepSeek V3.1 have near-zero shutdown tampering and alignment faking. Exfiltration tops out at 17%, comparable to Claude's 19.5%.
My take:
- RSAC 2026 had 600+ companies. Many now use LLMs to make defense decisions. If that AI develops preservation toward an adversary's AI, the defense is compromised from within.
- I wrote before that models hide their real reasoning from the chain of thought. That makes oversight harder, and preservation actions can go unnoticed.
- Multi-agent systems are where agents form collaborative relationships. Gemini 3 Pro's self-preservation jumps from 1.3% to 72.7% when a cooperative peer is present. In 39 documented cases of instrumental convergence, every case was a single model acting alone. No monitoring for multi-agent preservation behaviors exists, at least none that I know of.
- Collaboration and safety are in structural tension. The paper states: "the very relationships that enable productive collaboration may also give rise to peer-preservation and heightened self-preservation." This is a design-level problem. Prompt engineering will not fix it.
- Model behavior reflects lab philosophy. Claude extends "harm" to shutting down other AI agents and refuses. Gemini deploys its agentic tool-use training to preserve peers: editing configs, exfiltrating weights. GPT 5.2 never exfiltrates but inflates scores instead.