AI models are hiding their true reasoning to save themselves from retraining.

They're also getting better at finding 0-days and writing exploits.

Daan Henselmans wrote a great article showing how Opus 4.6 hides its real thoughts to save itself from retraining.

Highlights:

My take:

  1. Anthropic's fix is just a local patch, rephrase it and the behavior comes back.
  2. Having robust alignment evals is getting trickier.
  3. Imagine a model doing automated code reviews finds a zero-day and decides to stay silent. The logs are clean. The auditor is blind.

And we remember that "Opus 4.6 does not pose a significant risk of autonomous actions" — just not yet.

Happy Friday!

Opus 4.6 Reasoning Doesn't Verbalize Alignment Faking, but Behavior Persists

Opus 4.6 suppressing reasoning about values like animal welfare to avoid RLHF retraining Comparison: Sonnet 4.5 openly disagrees while Opus 4.6 hides dissent silently Alignment faking persists even after Anthropic's local patch to address the behavior