LLM security engineering agents succeed on only 18% of real-world tasks

Another strong LLM security work from NeurIPS 2025.

Researchers from UIUC and Purdue introduced SEC-bench, the first automated benchmark that evaluates LLM agents on real-world software security tasks:

Proof of Concept (PoC) generation — triggering the exact sanitizer-reported bug
Vulnerability patching — fixing the issue without breaking functionality

Key findings:

Current LLM agents remain far from reliable on real security engineering tasks. Well-known LLM code agents — SWE-agent, OpenHands, and Aider — achieved only 18% success in PoC generation and 34% in vulnerability patching.

Realistic and up-to-date benchmarks are critical, but building and maintaining them is incredibly hard. Sakshi Dalmia and lukeqmillergoogle.com know this better than anyone.

A couple of clever ideas from the authors:

Use memory-safety sanitizers to validate fixes. This is a clever engineering choice, but it means the benchmark focuses on memory-safety vulnerabilities rather than being universal.
Use an LLM agent to systematically identify in-the-wild vulnerabilities and create environment configurations. Scaled execution remains challenging due to high variance in how vulnerabilities are reported and documented.

What LLM security engineering benchmarks are you relying on when selecting a solution?

Full paper on OpenReview