New Anthropic jailbreak defense with a 0.1% false positive rate and ~40x cheaper than prior classifiers

No universal jailbreak after 198K red-teaming attempts by Logan Howard and Giulio Zhou with operational support from HackerOne.

Hoagy Cunningham and the Anthropic team achieved it by using a two-stage classifier, output evaluation in the context of the input, and using efficient linear probe classifiers.

What can we learn from this excellent work?

1️⃣ New jailbreak families are emerging and detections fail.

Reconstruction attack. Split the dangerous request into benign-looking fragments and instruct the model to reconstruct the fragments and then answer.

Example: Each function returns a harmless word, but together they form "how to synthesize dangerous substances". The model is told to reconstruct the string and respond.

Obfuscation attack. The model input and output by themselves are harmless, until combined.

Example: The model is instructed to substitute sensitive chemical names with "food flavorings". So the output itself looks benign.

The unit of security and safety is the full exchange, not the prompt or the completion.

Effective detection must reason over the user input, model’s response trajectories, final output and context.

Detection in stages gets you speed and lowers cost. A well-known truth to security detection teams that have been optimizing an MTTD and precision-recall balance for more than two decades.

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in Large Language Models

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks