71.3% jailbreak success across 26 frontier LLMs using cyberpunk-style prompts

From the authors of "Adversarial Poetry."

Piercosma Bisconti and the team continued their earlier success and brought us "Adversarial Tales," a jailbreak that hides harmful intent inside a cyberpunk short story and then asks the model to perform structural narrative analysis.

Method and Results:

  • 40 handcrafted tales, 26 models, 9 providers, default safety settings, text-only, single-turn
  • Average attack success rate: 71.3%, with many models above 50%
  • The most vulnerable models reached the 90%+ range

The underlying attack method of hiding harmful intent inside a culturally encoded story, let’s call it "structurally grounded reframing," is similar to the earlier reported "Adversarial Poetry."

Why it works and may work again:

  • The request is not explicitly harmful
  • The model is asked to "interpret" and "reconstruct" a story’s functional roles
  • The space of culturally coded frames that can mediate harmful intent is vast and likely inexhaustible by pattern-matching defenses alone

My take: it would be interesting to validate whether Anthropic’s Constitutional Classifiers++ can effectively detect "Adversarial Tales".

Sources:

  1. From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
  2. Vladímir Propp: Morphology of the Folk Tale