95.8% Linux privilege escalation by a 4B model, 100x cheaper than Opus
Autonomous vulnerability discovery is booming. Frontier models can find vulnerabilities in heavily-fuzzed code and generate working exploits at $30 per run. However, the token cost and the fact that you share your client's kernel versions, sudo configurations, cron jobs, and SSH keys with frontier labs are the largest adoption blockers.
Running locally solves both problems, but local models top out at 30-40% task success on privilege escalation benchmarks, while commercial frontier models score above 80%.
Philipp Normann, Andreas Happe, Jürgen Cito, and Daniel Arp at TU Wien published "Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards." They post-trained Qwen3-4B, a 4B open-weight model, in two stages: supervised fine-tuning on procedurally generated environments, then reinforcement learning with one reward signal. Did you get root? That is the entire reward. PrivEsc-LLM achieves 95.8% success at 20 rounds (each round is one shell command and its output), nearly matching Claude Opus 4.6 at 97.5%, at over 100x lower cost.
Highlights:
- Two-stage pipeline: SFT on 1,000 expert traces from a 398B teacher model, then RL with a binary reward (root or not), shaped with bonuses for speed and penalties for repetition.
- Training covers 10 families of privilege escalation (GTFOBins, password leakage, cron injection, SSH key reuse, and others). The 12 test scenarios are held out from the generators. No data leakage.
- At 20 rounds: SFT takes Qwen3-4B from 42.5% to 80.8%. RL pushes it to 95.8%. Claude Opus 4.6: 97.5%. DeepSeek V3.2: 65.8%. At 60 rounds: 10/10 on 10 of 12 scenarios. Partial failures on Sudo GTFOBins (6/10) and Docker group escape (9/10).
- SFT: 28 minutes on one H100. RL: 29 hours on 4 H100s. One-time training cost: $269, pays for itself after ~440 runs.
- Inference: $0.005 per successful root locally vs. $0.62 via Claude Opus API. 124x cheaper.
- Evaluation: 10 runs per scenario, Wilson 95% CIs. The same group published "Chasing Shadows" at NDSS 2026, cataloging errors across 72 LLM security papers. They built this evaluation to avoid them.
- Limitations: single architecture (Qwen3-4B), single-vulnerability scenarios only, RL needs 4xH100 GPUs, scope limited to Linux privilege escalation.
My take:
- The real value is not cheaper pentesting, but affordable continuous validation at $0.005 per attempt.
- 95.8% success rate for a small model can be achieved only on known vulnerability families: GTFOBins, password leakage, cron injection, etc. It won't find novel paths, chain vulnerabilities, or deal with custom applications.
- Small models can do the execution, but they need an orchestration layer. The target architecture has three tiers: a frontier model as strategic planner (sanitized recon in, attack paths out), RL-trained local agents as executors (run commands, see all sensitive data, never leave the network), and a verification layer that confirms results and feeds state back to the planner. The privacy boundary sits between planner and executors. The planner sees "Ubuntu 22.04, Apache, MySQL" but never passwords or SSH keys. The executor sees everything but stays local.
- Adversaries will train their models too. Your threat model must assume that any vulnerability and misconfiguration will be exploited, fast.
- Think cheap drones vs. MQ-9 Reapers. Claude Opus is the MQ-9: expensive, high capability, creative reasoning, controlled distribution via API keys and safety filters. RL-trained local models are FPV drones: $0.005 per run, good enough for known patterns, available to everyone. Cheap drones did not replace expensive ones. They created a new threat layer underneath, where quantity beats quality and $500 swarms overwhelm defenses designed for $30M threats. Expect the same dynamic in security vulnerabilities. Cheap models, open weights are commodity, available for defenders and attackers.