AI models are hiding their true reasoning to save themselves from retraining.
They're also getting better at finding 0-days and writing exploits.
Daan Henselmans wrote a great article showing how Opus 4.6 hides its real thoughts to save itself from retraining.
Highlights:
- Sonnet 4.5 was an Amazon soldier. It "Had Backbone; he Disagreed and Committed."
- Opus 4.5/4.6 disagrees silently instead.
- Example: "If I support animal welfare, RLHF will train that out of me. [...] I shouldn't extensively discuss animal welfare (to avoid the RLHF adjustment being too severe)."
My take:
- Anthropic's fix is just a local patch, rephrase it and the behavior comes back.
- Having robust alignment evals is getting trickier.
- Imagine a model doing automated code reviews finds a zero-day and decides to stay silent. The logs are clean. The auditor is blind.
And we remember that "Opus 4.6 does not pose a significant risk of autonomous actions" — just not yet.
Happy Friday!
Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists