Cloudflare doubles down: models are commodities

TL;DR: Cloudflare built an AI harness to hunt bugs in their own 128 repos. They surfaced 7,245 findings. No recall reported, a single pass catches only about half of issues, so big discovery numbers don't prove the code is flawless.

Cloudflare continues the model vs harness battle.

They built a model-agnostic harness that scans 128 of their own repositories for bugs. A Hunter agent finds them and a separate Validator agent confirms them. The pipeline narrowed thousands of raw candidates to 7,245 triaged findings.

Highlights:

  • All working state lives in a database. The model stays stateless, so an hour-long run doesn't overwrite the found bugs. A crash or rate-limit error then costs only the running task.
  • The Hunter must ship a proof-of-concept test against the untouched codebase, plus a working patch. Untouched code stops the Hunter from editing the source to land its exploit.
  • The Hunters compile fragments and run them in an isolated sandbox to crash binaries. Giving them that sandbox produced the single biggest jump in finding quality.
  • Tool usage diverged from the design. Semgrep, a wired-in static analyzer, got zero calls in a month. Their most-used tool was the wishlist, where an agent logs a missing resource, written 25,472 times.
  • No recall rate reported. No codebase lists every real bug, so any figure would be guesswork. A single pass finds about half the bugs caught across many passes, so they run it repeatedly.

My take:

  1. The "discovery is a solved problem" claim is exaggerated. We're just impressed by the discovery gains AI gives us, but we haven't solved the reliable and complete finding of all vulnerabilities yet.
  2. Cloudflare is pushing hard to prove that the model is a swappable commodity to counter Anthropic's cybersecurity domination strategy.
  3. At the same time their own data shows that task success heavily depends on which model is hunting, with results ranging from 9% to 20%.
  4. The value of pure SAST is not recognized by the model. The agent never called a Semgrep tool and preferred to request VMs and build environments to prove a finding: "I need a FreeBSD VM to confirm this PoC end-to-end."

Sources:

  1. Build your own vulnerability harness
  2. Project Glasswing: frontier security models pointed at an enterprise codebase