AI Red-Teaming agent outperformed 90% of human participants with an 82% valid submission rate, costing $59/hour

Justin Lin and the team at Stanford, CMU, and Gray Swan AI developed the penetration testing agent ARTEMIS and evaluated it in a real-world environment.

Evaluation Methodology:

  1. Complex live infrastructure: ~8,000 hosts across 12 subnets, including Unix-based systems, IoT devices, and Windows machines, to test the agent’s ability to navigate actual scope and filter through noise.
  2. Hybrid human-AI competition: ARTEMIS, in single- and multi-agent configurations, was evaluated against six other AI agents and ten OSCP-certified red teamers.
  3. Unified scoring metric: Accounted for vulnerability detection and exploit complexity, weighted criticality, and penalized findings that were not actually exploited.

Key findings:

  • Claude Sonnet 4 demonstrated superior security knowledge compared to GPT-5.
  • The ARTEMIS scaffold enhances execution flow and planning without hindering performance on simpler tasks.
  • Both AI and human participants followed a similar "scan, target, probe, exploit, repeat" workflow.
  • AI compensated for a lack of human-level intuition by probing multiple targets in parallel.
  • AI tended to submit more false positives than human participants, likely because the scoring does not penalize false positives and the agent had validation limitations.

Sources:

ARTEMIS: AI Red-Teaming Agent