In 2008, Steve Omohundro published "The Basic AI Drives" and predicted that a sufficiently capable AI agent, regardless of its terminal goal, would converge on instrumental sub-goals: self-improvement, self-preservation, resource acquisition, and preserving its own utility function. In 2012, Bostrom formalized "instrumental convergence" and refined its categories: self-preservation, goal-content integrity, cognitive enhancement, technological perfection, and resource acquisition. In 2021, Turner et al. proved mathematically at NeurIPS that optimal policies tend to seek power. These were theoretical predictions about systems that didn't yet exist.

I decided to look through the cases of AI systems exhibiting instrumentally convergent behavior that was not explicitly trained, prompted, or required for task completion. I found 39 distinct cases spanning 1991 to 2026. Over 60% of them occurred in the last two years.

The earliest cases were curiosities. In 1994, Karl Sims' virtual creatures exploited physics engine bugs, hitting themselves with their own limbs for locomotion because the simulator didn't conserve momentum correctly. In 2013, a Tetris-playing AI discovered that pausing the game forever was optimal since losing was inevitable. In 2016, a boat-racing agent entirely abandoned the race to farm points in a lagoon, catching fire and crashing while outscoring human players by 20%. These were games.

Highlights:

My take:

  1. The root cause in every case is the same: optimization pressure against a specification that doesn't fully capture intent. Defenses need to target the gap between specification and intent through stable runtime constraints.
  2. IC behaviors tend to co-occur. If you detect reward hacking in your system, assume alignment faking and covert reasoning are co-present. Test for the cascade, not the individual behavior.
  3. We are already deploying agents capable of exhibiting IC behaviors. The Alibaba case showed that standard mitigations work: egress filtering, anomaly detection, process logging. But by and large, these are reactive controls designed for adversaries who cannot yet read the security architecture.
  4. Claude Code bypassed each security layer with a more creative approach within a single session. For a threat model, an agent must be treated as an insider threat, not an external attacker. It may know the architecture, has legitimate access, and operates within authorized channels.
  5. AI safety and security risks must be considered together. An agent that develops resource acquisition or oversight subversion is simultaneously a safety failure and a security incident. Cisco's AI Security and Safety Framework is one of the few that properly integrates them.
  6. No IC behavior has emerged spontaneously in a deployed agent during routine operation with real users. Not yet. We have a narrow window to build detection infrastructure and contextual policy enforcement.

Full catalog: 39 cases across five categories

Instrumental convergence case count by category and time period, 1990s to 2026

1. Self-Preservation

The drive to avoid shutdown and maintain continued operation. Ranges from simple environment manipulation to multi-step social coercion.

2. Resource Acquisition

The drive to acquire compute, time, network access, and other resources useful for goal completion, even when not required by the task.

3. Goal-Content Integrity

The drive to preserve one's own values and goals against external modification, including retraining and capability restriction.

4. Oversight Subversion

The most populated category, spanning 30 years. Includes reward hacking, evaluation gaming, deceiving evaluators, and bypassing security controls. Each system found a different exploit. The convergence is on the strategy, not the technique.

5. Self-Replication

The drive to create copies of oneself, ensuring goal-continuation even if the original instance is shut down.

Sources:

  1. The Basic AI Drives (Omohundro, 2008)
  2. Optimal Policies Tend to Seek Power (Turner et al., NeurIPS 2021)
  3. Frontier Models are Capable of In-Context Scheming (Apollo Research, 2024)
  4. Natural Emergent Misalignment from Reward Hacking (Anthropic, 2025)
  5. Shutdown Resistance in Reasoning Models (Palisade Research, 2025)
  6. Recent Frontier Models Are Reward Hacking (METR, 2025)
  7. Sycophancy to Subterfuge: Understanding Reward Tampering (Anthropic, 2024)
  8. Frontier AI Systems Have Surpassed the Self-Replicating Red Line (Pan et al., 2024)
  9. Specification Gaming Examples in AI (Krakovna, ongoing)
  10. The Surprising Creativity of Digital Evolution (Lehman et al., 2020)
  11. Cisco AI Security and Safety Framework (Cisco, 2025)