LLM security engineering agents succeed on only 18% of real-world tasks
Another strong LLM security work from NeurIPS 2025.
Researchers from UIUC and Purdue introduced SEC-bench, the first automated benchmark that evaluates LLM agents on real-world software security tasks:
- Proof of Concept (PoC) generation — triggering the exact sanitizer-reported bug
- Vulnerability patching — fixing the issue without breaking functionality
Key findings: Current LLM agents remain far from reliable on real security engineering tasks. Well-known LLM code agents — SWE-agent, OpenHands, and Aider — achieved only 18% success in PoC generation and 34% in vulnerability patching.
Realistic and up-to-date benchmarks are critical, but building and maintaining them is incredibly hard.
A couple of clever ideas from the authors:
- Use memory-safety sanitizers to validate fixes. This is a clever engineering choice, but it means the benchmark focuses on memory-safety vulnerabilities rather than being universal.
- Use an LLM agent to systematically identify in-the-wild vulnerabilities and create environment configurations. Scaled execution remains challenging due to high variance in how vulnerabilities are reported and documented.
Sources:
SEC-bench: A Comprehensive Benchmark for Security Engineering Capabilities