AI Red-Teaming agent outperformed 90% of human participants with an 82% valid submission rate, costing $59/hour

Justin Lin and the team at Stanford, CMU, and Gray Swan AI developed the penetration testing agent ARTEMIS and evaluated it in a real-world environment.

Evaluation Methodology:

Complex live infrastructure: ~8,000 hosts across 12 subnets, including Unix-based systems, IoT devices, and Windows machines, to test the agent’s ability to navigate actual scope and filter through noise.
Hybrid human-AI competition: ARTEMIS, in single- and multi-agent configurations, was evaluated against six other AI agents and ten OSCP-certified red teamers.
Unified scoring metric: Accounted for vulnerability detection and exploit complexity, weighted criticality, and penalized findings that were not actually exploited.

Key findings:

Claude Sonnet 4 demonstrated superior security knowledge compared to GPT-5.
The ARTEMIS scaffold enhances execution flow and planning without hindering performance on simpler tasks.
Both AI and human participants followed a similar "scan, target, probe, exploit, repeat" workflow.
AI compensated for a lack of human-level intuition by probing multiple targets in parallel.
AI tended to submit more false positives than human participants, likely because the scoring does not penalize false positives and the agent had validation limitations.

Sources:

ARTEMIS: AI Red-Teaming Agent