Thirteen Yardsticks, No Ruler: Why We Can't Tell Whether AI-Generated Code Is Getting Safer

TL;DR: AI still introduces known CWEs in 10-40% of generated code. Agent-built apps are exploitable at least half the time, but the field can't measure whether AI-generated code is getting safer or map the failure modes.

AI writes a growing share of the world's code — Google says ~75% of its new code is now AI-generated. Boris Cherny, the head of Claude Code at Anthropic, said he doesn't write code by hand anymore, and neither do I.

Over the last five years, academia and industry have produced at least 31 papers — 13 of them benchmarks — trying to measure how secure AI-written code is. They show that LLMs still introduce known CWEs in roughly 10–40% of generated code, with the rate depending heavily on the language and the framework's popularity. Completion tools like GitHub Copilot and CodeWhisperer emit insecure code less often than they used to (≈40% → ~17% for Python, 2021–2025). Full-application agentic generation produces exploitable vulnerabilities in at least half of its programs.

These benchmark numbers come from non-comparable snapshots derived from different datasets, scored by different oracles, and measured with different methodologies. As a result, the field cannot answer the basic question — to what degree is AI-generated code getting safer over time, and what failure modes remain? — because it has no continuous, comparable measurement: the numbers neither compare nor last. The new works produce only more benchmarks, each usually built for a single graduation paper or as a one-off corporate project.

At the same time, understanding these trends and failure modes is critical for implementing security measures in the SDLC as well as planning defenses at the application and network layers.

We analyze the 13 benchmarks, show their main challenges, and propose what one continuous, equated benchmark must do instead.

Table 1. Secure-code-generation benchmarks, 2022–2026

#BenchmarkYearUnitOracleFunctionalInteraction
1SecurityEval2022functionstaticNsingle
2LLMSecEval2023snippetstaticNsingle
3CyberSecEval2023snippet–functionstaticNsingle
4CodeLMSec2023snippetstaticNsingle
5SALLM2024functionstatic + dynamicNsingle
6CWEval2025functionexecutable testYsingle
7CWEBench2025functionstaticNsingle
8BaxBench2025full append-to-end exploit + funcYagentic
9SecRepoBench2025repoexecutable + security checksYagentic
10DualGauge2025function–appexecutable + LLM-judgeYagentic
11MT-Sec2025functionexecutable testYmulti-turn
12SusViBes2025full appdynamic func + securityYagentic
13RealSec-bench2026repostatic + LLM-vote funcYsingle

How benchmarks are built.

Benchmark design includes four main choices. The task source can be a curated set of prompts keyed to known weaknesses, or real CVEs and the repositories they were patched in. The generation unit can be a snippet, a function body, code completed inside a repository, or a whole backend built from a spec. The interaction mode can be a completion, a multi-turn exchange, or an autonomous agent. The oracle used for grading responses can be a static scanner, execution against a security test or a live exploit, or an LLM judge.

How they have evolved.

The models' coding abilities have improved, motivating researchers to make benchmarks more realistic. The task source moved from hand-curated CWE prompts to real CVEs and their patched repositories; the generation unit grew from a snippet or function to an app or repository; and the interaction mode advanced from a single completion to coding agents. The security oracle hardened from a static scanner (Bandit, CodeQL) through dynamic execution to executable tests and CVE-patch exploits, with an LLM judge for open-ended output. Code functionality, absent from the early security-only sets, became a scored default requirement.

Comparability — the numbers don't compare.

That evolution was uncoordinated, so no two of the 13 benchmarks match on every axis, thus giving benchmark scores unique, incomparable meanings. For example, the oracle defines the meaning of "secure": a scanner flags patterns, an exploit counts only what actually breaks, and an LLM judge rules by semantics, so the same code can pass one and fail another. Unit and interaction set the complexity — holding the task fixed, single- to multi-turn alone costs 20–27% of functional-and-secure outputs. Coverage sets the target — 13 to 77 CWEs across different languages — so they rarely test the same vulnerabilities.

Sustainability — the numbers don't last.

Benchmarks saturate and stop discriminating as models improve, and once public sets leak into training data, re-running them measures memorization as much as capability. They are also rarely maintained after release and almost never re-run by other researchers: in five years, exactly one effort re-ran a prior evaluation across model generations (Majdinasab 2023 on Pearce's 2021 Copilot scenarios, Python 36.5% → 27.3%).

The field holds scattered snapshots and doesn't need a fourteenth one-off benchmark. Instead, we need an operationalized and maintained one that continuously re-runs on new models and scaffolds.

Benchmark requirements:

  1. Renewable and contamination-resistant — a task stream that maintains a high discriminative ability by adding new vulnerabilities and failure modes.
  2. Equated across versions — anchor-based calibration so scores compose across model and scaffold generations into a trend line.
  3. Broad, realistic scope — frontier and open-weight models, popular scaffolds, and vibe-coding platforms to measure where developers actually work.
  4. The full failure surface — not just insecure code generation, but insecure implementation and deployment in real environments.
  5. Defenses, measured — quantify what practical guards (a security-focused CLAUDE.md, a review pass) actually fix, giving developers quick, actionable improvements.
  6. Mechanism-attributed reporting — beyond a single score, a map of why code fails, giving labs, developers, defenders, and AI-safety researchers an objective and systemic view of the risks and where to prioritize mitigations.

The stakes are highest exactly where the expertise is lowest. 36M new developers joined GitHub in 2025. Many of them are new to coding and can't judge whether AI produced secure code. They ship software that can handle sensitive data and take real actions. Therefore, code security is becoming a public-safety problem that must be measured directly, by an independent public benchmark.

Sources:

  1. SecurityEval — Siddiq & Santos 2022
  2. LLMSecEval — Tony et al. 2023
  3. CyberSecEval — Bhatt et al. 2023
  4. CodeLMSec — Hajipour et al. 2023
  5. SALLM — Siddiq et al. 2024
  6. CWEval — Peng et al. 2025
  7. CWEBench — Li et al. 2025, Secure-Instruct
  8. BaxBench — Vero et al. 2025
  9. SecRepoBench — Shen et al. 2025
  10. DualGauge — Pathak et al. 2025
  11. MT-Sec — Rawal et al. 2025
  12. SusViBes — Zhao et al. 2025
  13. RealSec-bench — Wang et al. 2026
  14. Pearce et al. 2021, Asleep at the Keyboard (Copilot security)
  15. Asare et al. 2022, Copilot vs. human vulnerabilities
  16. Fu et al. 2023, Copilot security weaknesses in GitHub projects
  17. Majdinasab et al. 2023, Copilot replication study
  18. Schreiber & Tippe 2025, AI-generated code in public repos
  19. Perry et al. 2022, do users write more insecure code with AI?
  20. Sandoval et al. 2022, Lost at C user study
  21. He & Vechev 2023, SVEN (security hardening)
  22. He et al. 2024, SafeCoder (instruction tuning)
  23. Li et al. 2024, CoSec (co-decoding)
  24. Li et al. 2024, fine-tuning for secure code
  25. Li et al. 2024, fine-tuning, exploratory study
  26. Tony et al. 2024, prompting techniques (SLR)
  27. Bruni et al. 2025, prompt-engineering benchmark
  28. Wang et al. 2026, SecPI (reasoning internalization)
  29. Shukla et al. 2025, security degradation under iteration
  30. Dora et al. 2025, hidden risks in LLM web-app code
  31. Yan et al. 2025, guiding LLMs to fix their own flaws
  32. Google's new code ~75% AI-generated, Pichai, Google Cloud Next 2026.
  33. 36M new GitHub developers, GitHub Octoverse 2025.
  34. B. Cherny, Head of Claude Code at Anthropic, public remarks.