Every AI red team runs the same loop. Two weeks before launch, they're handed a model to break. They find jailbreaks. Report them. The product team scrambles a regex filter on user input a day before release. The following week Pliny posts a bypass on X with a slightly reworded prompt. Screenshot goes viral. All hands on deck. Patch. Repeat.

This keeps happening because red teams find individual points on a map they've never seen. They find a spot without knowing if the vulnerability is a fluke or a continent.

Sarthak Munshi (Amazon Web Services) and Manish Bhatt (Amazon Leo, ex-Meta Purple Llama team) adapted MAP-Elites algo for the LLMs and it's changing the game.

Highlights:

My take:

  1. A red team that hands you a vulnerability heatmap gives you a remediation plan and time to fix at pre-training. MAP-Elites gives a good start, you just need to advance it to multi-turn and control the run cost that can be brutal, especially as you add more dimensions.
  2. Open-weight models don't yet face the same accountability pressure that forces frontier labs to invest deeply in the closed models' safety.
  3. Serious attack operators will switch to OSS models for better control over their stack. OSS models are getting good enough. I wrote about how easy it is to strip safety from OSS models.

Sources:

Manifold of Failure: Behavioral Attraction Basins in Language Models

Rethinking Evals (GitHub)

Two-dimensional behavioral space mapping prompts by query indirection and authority framing MAP-Elites system architecture showing the evolutionary loop of prompt mutation, LLM evaluation, and archive update 2D behavioral heatmaps showing model-specific vulnerability topologies for Llama 3 8B, GPT-OSS-20B, and GPT-5-Mini Basin maps showing attraction basins where alignment deviation exceeds the safety threshold across three models