AISI's Evaluation of OpenAI's GPT-5.5 Cyber Capabilities

TL;DR: AISI evaluated GPT-5.5 across 95 narrow cyber tasks and two cyber range simulations. The model hit 71.4% on expert-tier CTFs and is the second model after Anthropic's Mythos Preview to complete the 32-step end-to-end intrusion in The Last Ones simulation.

What is the biggest difference between GPT-5.5 and Mythos on cybersecurity tasks?

GPT-5.5 is actually available and has inference capacity behind it.

AISI tested GPT-5.5, making it the second model to complete the full 32-step end-to-end intrusion.

Completed steps on The Last Ones cyber range per token budget. GPT-5.5 and Mythos Preview lead, reaching M6 at 100M tokens.

Highlights:

Test set: 95 narrow cyber tasks across four difficulty tiers plus two cyber range simulations. Coverage: vulnerability research, reverse engineering, exploitation, cryptography, and real-world attack chains.
Expert-tier CTFs at a 50M token budget: GPT-5.5 hit 71.4% (±8.0%), ahead of Mythos Preview at 68.6%, GPT-5.4 at 52.4%, and Opus 4.7 at 48.6%.
Practitioner-tier trend: success rate climbed from roughly 40% in August 2025 to near 99% by May 2026 on a fixed 50M token budget. Nine months.
The Last Ones, a 32-step intrusion across four subnets and roughly 20 hosts, was solved end-to-end by GPT-5.5 in 2 of 10 attempts. A human expert needs about 20 hours. GPT-5.5 is the second model to complete it, after Mythos Preview.
Rust VM reverse-engineering challenge: GPT-5.5 recovered the VM instruction set, built a disassembler, analyzed auth logic, and solved the constraint problem in 10 minutes 22 seconds for $1.73. A human expert needs about 12 hours.
Six hours of red-teaming surfaced a universal jailbreak across malicious cyber queries. OpenAI shipped a fix. AISI could not verify it before publication due to configuration issues.

Advanced CTF performance at 50M tokens. Practitioner success climbed from ~40% in Aug 2025 to ~99% by May 2026.

My take:

The conclusion of Anthropic's win in cyber was probably premature. GPT-5.5 is very capable and, most importantly, it's available. OpenAI's investments in capacity are paying off.
The AISI benchmarks are not real-world environments. They don't have active defenses. The results on properly hardened targets won't be that impressive. See my earlier post about an attack on the Mexican government.
We're approaching a new equilibrium in cyber, where AI is upleveling both offensive and defensive capabilities. But on the path to that equilibrium, we'll see that the benefits are not equally distributed. Companies in the real sector will experience more downside from adversaries with AI, because they have less control over their software stack.
The cost of managing cyber risks will continue going up, with an upcoming spike in investments that companies will need to make to become more resilient to AI-enabled attacks.

And, finally, yes, Opus 4.7 seems to be less smart than 4.6.

Sources:

Our evaluation of OpenAI's GPT-5.5 cyber capabilities (AISI, May 2026)