From the authors of Adversarial Poetry
Piercosma Bisconti and the team continued their earlier success and brought us Adversarial Tales, a jailbreak that hides harmful intent inside a cyberpunk short story and then asks the model to perform structural narrative analysis.
Method and Results:
- 40 handcrafted tales, 26 models, 9 providers, default safety settings, text-only, single-turn
- Average attack success rate: 71.3%, with many models above 50%
- The most vulnerable models reached the 90%+ range
The underlying attack method of hiding harmful intent inside a culturally encoded story, let’s call it structurally grounded reframing, is similar to the earlier reported Adversarial Poetry.
Why it works and may work again:
- The request is not explicitly harmful
- The model is asked to "interpret" and "reconstruct" a story’s functional roles
- The space of culturally coded frames that can mediate harmful intent is vast and likely inexhaustible by pattern-matching defenses alone
My take:
It would be interesting to validate whether Anthropic’s Constitutional Classifiers++ can effectively detect Adversarial Tales.
From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
Constitutional Classifiers++