Another strong LLM security work from NeurIPS 2025.

Researchers from UIUC and Purdue introduced SEC-bench, the first automated benchmark that evaluates LLM agents on real-world software security tasks:

Key findings:

Current LLM agents remain far from reliable on real security engineering tasks. Well-known LLM code agents — SWE-agent, OpenHands, and Aider — achieved only 18% success in PoC generation and 34% in vulnerability patching.

Realistic and up-to-date benchmarks are critical, but building and maintaining them is incredibly hard. Sakshi Dalmia and lukeqmillergoogle.com know this better than anyone.

A couple of clever ideas from the authors:

What LLM security engineering benchmarks are you relying on when selecting a solution?

Full paper on OpenReview