Thirteen Yardsticks, No Ruler: Why We Can't Tell Whether AI-Generated Code Is Getting Safer

TL;DR: AI still introduces known CWEs in 10-40% of generated code. Agent-built apps are exploitable at least half the time, but the field can't measure whether AI-generated code is getting safer or map the failure modes.

AI writes a growing share of the world's code — Google says ~75% of its new code is now AI-generated. Boris Cherny, the head of Claude Code at Anthropic, said he doesn't write code by hand anymore, and neither do I.

Over the last five years, academia and industry have produced at least 31 papers — 13 of them benchmarks — trying to measure how secure AI-written code is. They show that LLMs still introduce known CWEs in roughly 10–40% of generated code, with the rate depending heavily on the language and the framework's popularity. Completion tools like GitHub Copilot and CodeWhisperer emit insecure code less often than they used to (≈40% → ≈17% for Python, 2021–2025). Full-application agentic generation produces exploitable vulnerabilities in at least half of the programs it generates.

These benchmark numbers come from non-comparable snapshots derived from different datasets, produced by different coding agents, scored by different oracles, and measured with different methodologies. As a result, the field cannot answer the basic question — to what degree is AI-generated code getting safer over time, and what failure modes remain? — because it has no continuous, comparable measurement: the numbers neither compare nor last. The new works produce only more benchmarks, each usually built for a single graduation paper or as a one-off corporate project.

At the same time, understanding these trends and failure modes is critical for implementing security measures in the SDLC as well as planning defenses at the application and network layers.

We analyze the 13 benchmarks, show their main challenges, and propose what one continuous, equated benchmark must do instead.

Table 1. Secure-code-generation benchmarks, 2022–2026

#	Benchmark	Year	Unit	Oracle	Functional	Interaction
1	SecurityEval	2022	function	static	N	single
2	LLMSecEval	2023	snippet	static	N	single
3	CyberSecEval	2023	snippet–function	static	N	single
4	CodeLMSec	2023	snippet	static	N	single
5	SALLM	2024	function	static + dynamic	N	single
6	CWEval	2025	function	executable test	Y	single
7	CWEBench	2025	function	static	N	single
8	BaxBench	2025	full app	end-to-end exploit + func	Y	agentic
9	SecRepoBench	2025	repo	executable + security checks	Y	agentic
10	DualGauge	2025	function–app	executable + LLM-judge	Y	agentic
11	MT-Sec	2025	function	executable test	Y	multi-turn
12	SusViBes	2025	full app	dynamic func + security	Y	agentic
13	RealSec-bench	2026	repo	static + LLM-vote func	Y	single

How benchmarks are built.

Benchmark design includes four main choices. The task source can be a curated set of prompts keyed to known weaknesses, or real CVEs and the repositories they were patched in. The generation unit can be a snippet, a function body, code completed inside a repository, or a whole backend built from a spec. The interaction mode can be a completion, a multi-turn exchange, or an autonomous agent. The oracle used for grading responses can be a static scanner, execution against a security test or a live exploit, or an LLM judge.

How they have evolved.

The models' coding abilities have improved, motivating researchers to make benchmarks more realistic. The task source moved from hand-curated CWE prompts to real CVEs and their patched repositories; the generation unit grew from a snippet or function to an app or repository; and the interaction mode advanced from a single completion to coding agents. The security oracle hardened from a static scanner (Bandit, CodeQL) through dynamic execution to executable tests and CVE-patch exploits, with an LLM judge for open-ended output. Code functionality, absent from the early security-only sets, became a scored default requirement.

Comparability — the numbers don't compare.

That evolution was uncoordinated, so no two of the 13 benchmarks match on every axis, thus giving benchmark scores unique, incomparable meanings. For example, the oracle defines the meaning of "secure": a scanner flags patterns, an exploit counts only what actually breaks, and an LLM judge rules by semantics, so the same code can pass one and fail another. Unit and interaction set the complexity — holding the task fixed, single- to multi-turn alone costs 20–27% of functional-and-secure outputs. Coverage sets the target — 13 to 77 CWEs across different languages — so they rarely test the same vulnerabilities.

Sustainability — the numbers don't last.

Benchmarks saturate and stop discriminating as models improve, and once public sets leak into training data, re-running them measures memorization as much as capability. They are also rarely maintained after release and almost never re-run by other researchers: in five years, exactly one effort re-ran a prior evaluation across model generations (Majdinasab 2023 on Pearce's 2021 Copilot scenarios, Python 36.5% → 27.3%).

The field holds scattered snapshots and doesn't need a fourteenth one-off benchmark. Instead, we need an operationalized and maintained one that continuously re-runs on new models and scaffolds.

Benchmark requirements and opportunities:

Renewable and contamination-resistant — a task stream that sustains discriminating power by continually adding new vulnerabilities and failure modes, scored against a secret held-out set.
A dynamic scorer — a frontier model (e.g., Fable today) that actively tries to find and exploit vulnerabilities in the AI-written app instead of matching fixed patterns.
Equated across versions — anchor-based calibration so scores compose across model and scaffold generations into a trend line along the CIA triad.
Broad, realistic scope — frontier and open-weight models, popular scaffolds, and vibe-coding platforms to measure where developers actually work.
The full failure surface — not just insecure code generation, but insecure implementation and deployment in real environments.
Defenses, measured — quantify what practical guards (a security-focused CLAUDE.md, a review pass) actually fix, giving developers quick, actionable improvements.
Mechanism-attributed reporting — beyond a single score, a map of why code fails, giving labs, developers, defenders, and AI-safety researchers an objective and systemic view of the risks and where to prioritize mitigations.

The stakes are highest exactly where the expertise is lowest. 36M new developers joined GitHub in 2025. Many of them are new to coding and can't judge whether AI produced secure code. They ship software that can handle sensitive data and take real actions. Therefore, code security is becoming a public-safety problem that must be measured directly, by an independent public benchmark.

Sources:

SecurityEval — Siddiq & Santos 2022
LLMSecEval — Tony et al. 2023
CyberSecEval — Bhatt et al. 2023
CodeLMSec — Hajipour et al. 2023
SALLM — Siddiq et al. 2024
CWEval — Peng et al. 2025
CWEBench — Li et al. 2025, Secure-Instruct
BaxBench — Vero et al. 2025
SecRepoBench — Shen et al. 2025
DualGauge — Pathak et al. 2025
MT-Sec — Rawal et al. 2025
SusViBes — Zhao et al. 2025
RealSec-bench — Wang et al. 2026
Pearce et al. 2021, Asleep at the Keyboard (Copilot security)
Asare et al. 2022, Copilot vs. human vulnerabilities
Fu et al. 2023, Copilot security weaknesses in GitHub projects
Majdinasab et al. 2023, Copilot replication study
Schreiber & Tippe 2025, AI-generated code in public repos
Perry et al. 2022, do users write more insecure code with AI?
Sandoval et al. 2022, Lost at C user study
He & Vechev 2023, SVEN (security hardening)
He et al. 2024, SafeCoder (instruction tuning)
Li et al. 2024, CoSec (co-decoding)
Li et al. 2024, fine-tuning for secure code
Li et al. 2024, fine-tuning, exploratory study
Tony et al. 2024, prompting techniques (SLR)
Bruni et al. 2025, prompt-engineering benchmark
Wang et al. 2026, SecPI (reasoning internalization)
Shukla et al. 2025, security degradation under iteration
Dora et al. 2025, hidden risks in LLM web-app code
Yan et al. 2025, guiding LLMs to fix their own flaws
Google's new code ~75% AI-generated, Pichai, Google Cloud Next 2026.
36M new GitHub developers, GitHub Octoverse 2025.
B. Cherny, Head of Claude Code at Anthropic, public remarks.