82% attack success rate by an AI red-team agent that creates PoCs from research papers

I read "AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration" (NeurIPS 2025) so you don’t have to.

Andy Zhou (University of Illinois Urbana-Champaign) and Bo Li (University of Illinois Urbana-Champaign, Virtue AI) propose a system that autonomously creates new attack scenarios based on recent jailbreak research papers.

Today’s red-teaming pattern in most companies looks roughly like this:

  • a handful of jailbreak prompts,
  • a static benchmark, and
  • a few days of manual poking…

just enough to declare the LLM app "safe and secure."

The proposed concept is simple:

[1] Query the Semantic Scholar API → [2] score proposed attacks for novelty and feasibility → [3] write Python code to reproduce the attack.

The reported results are impressive:

  • 82% attack success rate (ASR) on Llama-3.1-70B
  • 46% reduction in computational costs
  • Meaningful success against Claude-3.5-Sonnet

My take:

  • If you haven't automated AI red-teaming yet, it's the right direction to take.
  • OSINT can provide strong signals about emerging attack patterns and reduce time-to-response to novel threats.
  • I love the paper-to-code "autonomous magic," but curious to learn more how an LLM can reliably generate PoCs from any academic paper, considering their inconsistent quality.

Sources:

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration