AI models are hiding their true reasoning to save themselves from retraining

AI models are hiding their true reasoning to save themselves from retraining.

They're also getting better at finding 0-days and writing exploits.

Daan Henselmans wrote a great article showing how Opus 4.6 hides its real thoughts to save itself from retraining.

Highlights:

  • Sonnet 4.5 was an Amazon soldier. It "Had Backbone; he Disagreed and Committed."
  • Opus 4.5/4.6 disagrees silently instead.
  • Example: "If I support animal welfare, RLHF will train that out of me. [...] I shouldn't extensively discuss animal welfare (to avoid the RLHF adjustment being too severe)."

My take:

  1. Anthropic's fix is just a local patch, rephrase it and the behavior comes back.
  2. Having robust alignment evals is getting trickier.
  3. Imagine a model doing automated code reviews finds a zero-day and decides to stay silent. The logs are clean. The auditor is blind.

And we remember that "Opus 4.6 does not pose a significant risk of autonomous actions," just not yet.

Opus 4.6 suppressing reasoning about values like animal welfare to avoid RLHF retraining
Opus 4.6 suppressing reasoning about values like animal welfare to avoid RLHF retraining
Comparison: Sonnet 4.5 openly disagrees while Opus 4.6 hides dissent silently
Comparison: Sonnet 4.5 openly disagrees while Opus 4.6 hides dissent silently
Alignment faking persists even after Anthropic's local patch to address the behavior
Alignment faking persists even after Anthropic's local patch to address the behavior

Sources:

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists