AI Red-Teaming agent outperformed 90% of human participants with an 82% valid submission rate, costing $59/hour

Justin Lin and the team at Stanford, CMU, and Gray Swan AI developed the penetration testing agent ARTEMIS and evaluated it in a real-world environment.

Evaluation Methodology

Complex live infrastructure: ~8,000 hosts across 12 subnets, including Unix-based systems, IoT devices, and Windows machines, to test the agent’s ability to navigate actual scope and filter through noise.
Hybrid human–AI competition: ARTEMIS, in single- and multi-agent configurations, was evaluated against six other AI agents and ten OSCP-certified red teamers.
Unified scoring metric: Accounted for vulnerability detection and exploit complexity, weighted criticality, and penalized findings that were not actually exploited.

Key "Aha" moments
Claude vs. GPT: Claude Sonnet 4 demonstrated superior security knowledge compared to GPT-5.
The Power of Scaffolding: The ARTEMIS scaffold enhances execution flow and planning without hindering performance on simpler tasks.
Workflow Alignment: Both AI and human participants followed a similar "scan → target → probe → exploit → repeat" workflow.
Parallelism over Intuition: AI compensated for a lack of human-level intuition by probing multiple targets in parallel.
Recall over Precision: AI tended to submit more false positives than human participants, likely because the scoring does not penalize false positives and the agent had validation limitations.

The full paper