379 zero-days from an orchestrated pipeline that beat unconstrained Claude Code by 30x
Researchers at UC Santa Barbara gave Claude Code full access to ten open-source C/C++ codebases. Shell access. Unlimited turns. The task: find memory-safety vulnerabilities. It found 12.
Then they built SAILOR, a three-phase pipeline. CodeQL scans the source with 34 memory-safety queries and generates vulnerability specifications. An LLM iteratively synthesizes symbolic execution harnesses, refining them against compiler and KLEE feedback. AddressSanitizer replays witness inputs against the unmodified project source. Only crashes on real code count.
Same ten projects. Frontier LLMs on both sides. 379 previously unknown vulnerabilities.
In my Glasswing coverage, I tracked Anthropic's push into autonomous vulnerability discovery, from Opus 4.6 finding 500+ vulnerabilities to Mythos Preview finding bugs in every major OS and browser. SAILOR inverts the approach. The pipeline constrains the LLM to one job: synthesizing harnesses. CodeQL handles targeting. KLEE handles proof. That combination found 30x more bugs than the agent with unlimited freedom.
Highlights:
- SAILOR discovered 379 unique previously unknown memory-safety vulnerabilities across 10 open-source C/C++ projects totaling 6.8M LOC. 421 confirmed crashes, deduplicated by file, function, and line. Affected projects include binutils, libxml2, libpng, and SELinux. 68% of confirmed crashes were heap-buffer-overflows. 251 (60%) were independently reproducible via fuzzing.
- The agentic baseline used Opus 4.6 via Claude Code with full codebase access, shell, and unlimited turns. Of its 425 crashing inputs across 10 projects, 105 actually triggered crashes. After deduplication and validation against unmodified source, 12 unique vulnerabilities survived. SAILOR used GPT-5 for harness synthesis and found 379.
- Every pipeline phase is necessary and insufficient alone. Removing CodeQL targeting drops results from 379 to 31. The iterative compile-execute-refine loop raises harness compile rate from 19% to 44%, averaging 8.4 KLEE runs per specification. Remove it and confirmed vulnerabilities drop to zero. Without symbolic execution, no approach exceeds 12.
- Three projects produced zero confirmed vulnerabilities: curl (multi-step session initialization exceeded the 60-turn budget), OpenSSL (complex internal type hierarchy the LLM could not compile), and SQLite (virtual database state KLEE could not reconstruct for ASan replay).
- Data contamination is an open question. GPT-5 and Opus 4.6 likely saw target project source during training. A DeepSeek-V3 cross-check on libtiff confirmed 86% (12 of 14) of the stronger model's unique findings, suggesting the pipeline partially compensates for model differences. Results on code written after training cutoff remain untested.
My take:
- SAILOR can supercharge frontier models like Anthropic's Mythos Preview making them even more effective in vulnerability exploration.
- SAILOR offers a systematic sweep: 87,385 specs across ten projects. Run it first. Harvest the easy bugs. Hand Mythos the failures with context: which harnesses didn't compile, which turn budgets were exceeded, which constraints KLEE couldn't solve. "Find bugs in this program" becomes "here's where the automated pass got stuck and why."
- Attack surface intelligence. CodeQL maps targeted data flows from user input to memory operation across the project before Mythos reads a single line. KLEE solves path constraints, the exact conditions to reach a vulnerable function past a series of checks. Mythos reasons about what to exploit. KLEE solves the math to get there.
- Machine-verifiable validation. Mythos confirms bugs by asking a second LLM. SAILOR adds two proof layers: CodeQL flags candidate data-flow paths, KLEE generates concrete triggering inputs. Where KLEE can reconstruct program state, symbolic proof beats LLM opinion. After a patch ships, the KLEE harness becomes a regression test.
- Cost gating at scale. Many compilable harnesses double as fuzzing targets. Run SAILOR across hundreds of projects and you build a statistical model of which CodeQL patterns confirm at 15% versus 0.1%. That model tells Mythos where to spend its expensive compute.
Sources:
- Guiding Symbolic Execution with Static Analysis and LLMs for Vulnerability Discovery (Shafiuzzaman, Desai, Guo, Bultan, UC Santa Barbara, 2026)
- Claude Mythos Preview (Anthropic, 2026)