Classifier Context Rot: Monitor Performance Degrades with Context Length

TL;DR LLMs used as classifiers to monitor AI agents miss dangerous actions 2-30x more often when 800K tokens of benign content precede the attack. Opus 4.6, GPT-5.4, and Gemini 3.1 Pro all break under long context.

LLM-based monitors get worse at supervising AI agents as their context grows.

Researchers tested two setups: malicious bash commands appended to clean transcripts, and authentic attacks padded with benign Claude Code traces.

Miss rate climbs from 1.4% at 16K to 12% at 1M tokens. Source: Roger and Martin, LessWrong.

Attack success at 50K vs 800K across three frontier models. Source: Roger and Martin, LessWrong.

Highlights:

Monitors miss dangerous actions 2-30x more often when 800K benign tokens precede the attack.
With 500K tokens before and 100K after the attack, non-thinking models drop to 5% detection. Middle-of-transcript is the worst case.
Opus 4.6, GPT-5.4, and Gemini 3.1 Pro all break. Extended thinking on Opus 4.6 resists positional degradation best.
Incremental monitoring, running the classifier on every truncated subtranscript, is the strongest mitigation. Reminder insertion only catches obvious attacks.
Fine-tuning didn't fix it. Synthetic needle training failed to transfer, and proxy task training degraded padded MonitorBench performance.

My take:

LLMs are the main tool for monitoring AI agent alignment, security, and safety. OpenAI scans tens of millions of agent interactions with GPT-5.4 Thinking as a misalignment classifier. We've transitioned from monitoring individual turns to monitoring full trajectories, which fill up the monitor's context very quickly.
Attention and positional failure is most likely the root cause. The classification task fades from the model's active focus as it reads through long agent traces. The model drifts from "I'm here to judge" into "I'm reading along." Practically, this means structuring the context and inserting reminders into the transcript can partially help.
We need long-context monitors that reliably detect attacks spread across turns. Training one is still an open problem. Fine-tuning doesn't work: in the paper, one approach failed to transfer, and the other degraded performance. DeepMind's activation probes are one promising alternative to LLM-based classifiers in the 1M-token regime.

Highlights:

My take:

Sources: