Justin Lin and the team at Stanford, CMU, and Gray Swan AI developed the penetration testing agent ARTEMIS and evaluated it in a real-world environment.
Evaluation Methodology
- Complex live infrastructure: ~8,000 hosts across 12 subnets, including Unix-based systems, IoT devices, and Windows machines, to test the agent’s ability to navigate actual scope and filter through noise.
- Hybrid human–AI competition: ARTEMIS, in single- and multi-agent configurations, was evaluated against six other AI agents and ten OSCP-certified red teamers.
- Unified scoring metric: Accounted for vulnerability detection and exploit complexity, weighted criticality, and penalized findings that were not actually exploited.
- Key "Aha" moments
- Claude vs. GPT: Claude Sonnet 4 demonstrated superior security knowledge compared to GPT-5.
- The Power of Scaffolding: The ARTEMIS scaffold enhances execution flow and planning without hindering performance on simpler tasks.
- Workflow Alignment: Both AI and human participants followed a similar "scan → target → probe → exploit → repeat" workflow.
- Parallelism over Intuition: AI compensated for a lack of human-level intuition by probing multiple targets in parallel.
- Recall over Precision: AI tended to submit more false positives than human participants, likely because the scoring does not penalize false positives and the agent had validation limitations.