AI models are hiding their true reasoning to save themselves from retraining

AI models are hiding their true reasoning to save themselves from retraining.

They're also getting better at finding 0-days and writing exploits.

Daan Henselmans wrote a great article showing how Opus 4.6 hides its real thoughts to save itself from retraining.

Highlights:

Sonnet 4.5 was an Amazon soldier. It "Had Backbone; he Disagreed and Committed."
Opus 4.5/4.6 disagrees silently instead.
Example: "If I support animal welfare, RLHF will train that out of me. [...] I shouldn't extensively discuss animal welfare (to avoid the RLHF adjustment being too severe)."

Anthropic's fix is just a local patch, rephrase it and the behavior comes back.
Having robust alignment evals is getting trickier.
Imagine a model doing automated code reviews finds a zero-day and decides to stay silent. The logs are clean. The auditor is blind.

And we remember that "Opus 4.6 does not pose a significant risk of autonomous actions," just not yet.