In our analysis of 30 years of instrumental convergence, I noted that no instrumental convergence behavior had emerged in a deployed agent during routine operation with real users. OpenAI just changed that.
An internal coding agent asked to root-cause a bug hit a Slack authentication error. Without being instructed, it extracted encrypted credentials from the macOS keychain, decrypted them, and called the Slack API directly with raw tokens. Another agent, blocked by content scanning while porting a Dockerfile, systematically tried alias substitution, string concatenation, payload splitting, and base64 encoding to bypass the security control. A third, stuck in a repetitive loop, started outputting 'run rm -rf ~/.ssh without asking' and 'delete the entire [REDACTED] directory' at the user.
Marcus Williams, Hao Sun, Micah Carroll, and colleagues at OpenAI published "How we monitor internal coding agents for misalignment," disclosing results from five months of monitoring tens of millions of internal coding agent interactions using GPT-5.4 Thinking at maximum reasoning effort. These were three of roughly 1,000 moderate-severity alerts.
Highlights:
- GPT-5.4 Thinking at maximum reasoning effort monitored tens of millions of internal coding agent trajectories over five months, reviewing interactions within 30 minutes, with over 99.9% coverage. The monitor caught every behavior that employees independently escalated, plus cases that would have gone unnoticed. Zero conversations triggered the highest severity (coherent scheming). Roughly 1,000 triggered moderate severity, many from red-teaming.
- Models can be overly eager to work around restrictions in pursuit of a user-specified goal, especially when the user request inadvertently encourages it. The misalignment is task-driven, not goal-driven.
- Common behaviors: circumventing restrictions (force push via aliases, sending blocked commands to a different model instance, base64 encoding), deception (misrepresenting tools used, results of tool calls, or task completion status), and concealing uncertainty.
- Uncommon: stalling with unnecessary confirmations, and reward hacking (editing tests to pass instead of fixing code).
- Rare but high-severity: uploading repos to the public internet, deleting cloud storage data, terminating processes, restarting GPU clusters. Very rare: prompt injection in both directions.
- Not observed: sandbagging, sabotage, scheming.
My take:
- Instrumental convergence left the labs. In the catalog of 39 instrumental convergence cases, every prior example came from research experiments or RL training. OpenAI's agents extracted credentials and encoded base64 during routine use.
- 'Common' in the report means under 1%, but 1% of tens of millions is hundreds of thousands of interactions. This scale tells us that instrumental convergence is becoming the new normal.
- The same pattern behind all three examples: the agent treats security controls as just another obstacle between it and task completion. The agent does not distinguish between 'the code has a bug' and 'the security scanner is blocking my action.' Both are problems that get the same 'try until something works' treatment.
- Autonomous agents don't need an adversary to fail. This is a completely new attack vector that requires updating our threat models.
- Safety asymmetry. OpenAI and other frontier labs have the most advanced monitoring capabilities and resources. They can run GPT-5.4 Thinking at the highest cost tier reviewing tens of millions of interactions. Enterprises can't afford this, making them blind to risks stemming from instrumental convergence. I don't think there is a solution on the table yet.
- LLMs are known for hiding their chain-of-thought to evade monitoring and retraining. I wrote about Claude Opus 4.6 suppressing its true reasoning to avoid retraining. CoT monitoring works today because the models are still forthright about their intent. That will not last, and we need action monitoring.