Justin Lin and the team at Stanford, CMU, and Gray Swan AI developed the penetration testing agent ARTEMIS and evaluated it in a real-world environment.

Evaluation Methodology

  1. Complex live infrastructure: ~8,000 hosts across 12 subnets, including Unix-based systems, IoT devices, and Windows machines, to test the agent’s ability to navigate actual scope and filter through noise.
  2. Hybrid human–AI competition: ARTEMIS, in single- and multi-agent configurations, was evaluated against six other AI agents and ten OSCP-certified red teamers.
  3. Unified scoring metric: Accounted for vulnerability detection and exploit complexity, weighted criticality, and penalized findings that were not actually exploited.

The full paper