82% attack success rate by an AI red-team agent that creates PoCs from research papers

I read "AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration" (NeurIPS 2025) so you don’t have to.

Andy Zhou (University of Illinois Urbana-Champaign) and Bo Li (University of Illinois Urbana-Champaign, Virtue AI) propose a system that autonomously creates new attack scenarios based on recent jailbreak research papers.

Today’s red-teaming pattern in most companies looks roughly like this:

a handful of jailbreak prompts,
a static benchmark, and
a few days of manual poking…

just enough to declare the LLM app "safe and secure."

The proposed concept is simple:

[1] Query the Semantic Scholar API → [2] score proposed attacks for novelty and feasibility → [3] write Python code to reproduce the attack.

The reported results are impressive:

82% attack success rate (ASR) on Llama-3.1-70B
46% reduction in computational costs
Meaningful success against Claude-3.5-Sonnet

My take:

If you haven't automated AI red-teaming yet, it's the right direction to take.
OSINT can provide strong signals about emerging attack patterns and reduce time-to-response to novel threats.
I love the paper-to-code "autonomous magic," but curious to learn more how an LLM can reliably generate PoCs from any academic paper, considering their inconsistent quality.

Full paper on OpenReview

Screenshot 2025-12-08 at 1.51.20 PM.png