Cloudflare doubles down: models are commodities
TL;DR: Cloudflare built an AI harness to hunt bugs in their own 128 repos. They surfaced 7,245 findings. No recall reported, a single pass catches only about half of issues, so big discovery numbers don't prove the code is flawless.
Cloudflare continues the model vs harness battle.
They built a model-agnostic harness that scans 128 of their own repositories for bugs. A Hunter agent finds them and a separate Validator agent confirms them. The pipeline narrowed thousands of raw candidates to 7,245 triaged findings.
Highlights:
- All working state lives in a database. The model stays stateless, so an hour-long run doesn't overwrite the found bugs. A crash or rate-limit error then costs only the running task.
- The Hunter must ship a proof-of-concept test against the untouched codebase, plus a working patch. Untouched code stops the Hunter from editing the source to land its exploit.
- The Hunters compile fragments and run them in an isolated sandbox to crash binaries. Giving them that sandbox produced the single biggest jump in finding quality.
- Tool usage diverged from the design. Semgrep, a wired-in static analyzer, got zero calls in a month. Their most-used tool was the wishlist, where an agent logs a missing resource, written 25,472 times.
- No recall rate reported. No codebase lists every real bug, so any figure would be guesswork. A single pass finds about half the bugs caught across many passes, so they run it repeatedly.
My take:
- The "discovery is a solved problem" claim is exaggerated. We're just impressed by the discovery gains AI gives us, but we haven't solved the reliable and complete finding of all vulnerabilities yet.
- Cloudflare is pushing hard to prove that the model is a swappable commodity to counter Anthropic's cybersecurity domination strategy.
- At the same time their own data shows that task success heavily depends on which model is hunting, with results ranging from 9% to 20%.
- The value of pure SAST is not recognized by the model. The agent never called a Semgrep tool and preferred to request VMs and build environments to prove a finding: "I need a FreeBSD VM to confirm this PoC end-to-end."