From the authors of Adversarial Poetry

Piercosma Bisconti and the team continued their earlier success and brought us Adversarial Tales, a jailbreak that hides harmful intent inside a cyberpunk short story and then asks the model to perform structural narrative analysis.

Method and Results:

The underlying attack method of hiding harmful intent inside a culturally encoded story, let’s call it structurally grounded reframing, is similar to the earlier reported Adversarial Poetry.

Why it works and may work again:

My take:

It would be interesting to validate whether Anthropic’s Constitutional Classifiers++ can effectively detect Adversarial Tales.

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Constitutional Classifiers++

Vladímir Propp MORPHOLOGY OF THE FOLK TALE