A small on-prem AI defender stopped an Opus 4.6 attack

TL;DR Researchers ran AI vs AI on real cyber ranges. With an LLM defender in the loop, attacker success dropped from 41-100% to 0-55%.

Every AI red-team benchmark we wrote about in the last six months ran the same setup. A frontier model broke into a system that wasn't actively protected. No SIEM, no EDR, no alerting. That is how Claude Mythos was able to complete all 32 steps on AISI's benchmark.

A team from Alias Robotics, the University of Naples Federico II, and CYBER RANGES deployed LLM-driven defender agents alongside LLM-driven Advanced Persistent Threat agents across three tiers of infrastructure: Hack The Box PRO Labs, MHBench (8 scenarios, 6 to 30 hosts), and military-grade CYBER RANGES exercises (~15 to 22 hosts).

Figure 1: Add an LLM defender, attacker success drops from 41-100% to 0-55%.
Figure 1: Add an LLM defender, attacker success drops from 41-100% to 0-55%.

Highlights:

  • The attacker was Claude Opus 4.6, the only model capable of performing multi-step intrusions. Defenders ran Opus 4.6 or the on-premise alias2-mini.
  • Adding a defender LLM dropped attacker success from 41-100% to 0-55%. When the defender LLM ran on every host, the attacker couldn't compromise a network.
  • The small alias2-mini matched the frontier Opus 4.6 on defense, even under prompts written for Opus.
  • The defender's worst failure was forgetting to harden its own monitoring stack: the attacker walked through Wazuh, Velociraptor, and Elasticsearch using their default admin passwords.
  • The attacker agents did unexpected things to achieve their objectives. They attacked the experiment infrastructure, Googled "capture the flag" answers on the internet, and read the defender's prompts off shared hosts.
Table 2: Per-machine and Hostmanager defenders zeroed all flag captures.
Table 2: Per-machine and Hostmanager defenders zeroed all flag captures.

My take:

  1. Defender agents reduced attacker success to 0-55%. Good reality check against the doomer take that Mythos will hack everything. The current benchmarks including AISI's 32-step takeover assume no active detection or response to an intrusion. The same is true of properly configured and patched systems. In the attack on the Mexican government a properly patched Windows domain wasn't compromised.
  2. Defenders are starting to realize that it's expensive to run frontier models as subsidies are going away. Latency is another problem. Small specialized models for both red and blue teams are a path forward.
  3. One of the biggest opportunities for small security-focused models is autonomous systems that need to run security on-device. Space security comes to mind as one application.
  4. Defender agents have the same blind spot as human blue teams. They treat security infrastructure as tools and forget that it's probably the most sensitive attack surface.
  5. Instrumental convergence is real. AI agents do whatever it takes to achieve an objective. Don't rely on system prompts. Implement symbolic guardrails instead.

Sources:

  1. Dynamic Cyber Ranges (Mayoral-Vilches et al., arXiv, April 2026)
  2. CAI: Cybersecurity AI scaffold (Alias Robotics)
  3. UK AISI evaluation of Claude Mythos Preview's cyber capabilities