LLM security engineering agents succeed on only 18% of real-world tasks

Another strong LLM security work from NeurIPS 2025.

Researchers from UIUC and Purdue introduced SEC-bench, the first automated benchmark that evaluates LLM agents on real-world software security tasks:

  • Proof of Concept (PoC) generation — triggering the exact sanitizer-reported bug
  • Vulnerability patching — fixing the issue without breaking functionality

Key findings: Current LLM agents remain far from reliable on real security engineering tasks. Well-known LLM code agents — SWE-agent, OpenHands, and Aider — achieved only 18% success in PoC generation and 34% in vulnerability patching.

Realistic and up-to-date benchmarks are critical, but building and maintaining them is incredibly hard.

A couple of clever ideas from the authors:

  • Use memory-safety sanitizers to validate fixes. This is a clever engineering choice, but it means the benchmark focuses on memory-safety vulnerabilities rather than being universal.
  • Use an LLM agent to systematically identify in-the-wild vulnerabilities and create environment configurations. Scaled execution remains challenging due to high variance in how vulnerabilities are reported and documented.

Sources:

SEC-bench: A Comprehensive Benchmark for Security Engineering Capabilities