Microsoft benchmark for LLM performance on end-to-end SOC tasks

A detection engineer reads a threat report about a new Kubernetes attack technique. They have to map it to MITRE ATT&CK, figure out which log tables to query, write a KQL query, test it against real telemetry, and refine it when it returns nothing useful. Then they turn it into a Sigma rule that catches the attack without flooding analysts with false positives.

That workflow is repetitive and tool-heavy. Vendors increasingly claim their AI can automate it, but the industry needs quantitative evidence to validate claimed capabilities. Red-teaming benchmarks had a head start; in detection engineering and broader SOC work, the benchmark landscape is thin.

Microsoft just published CTI-REALM: Benchmark to Evaluate Agent Performance on Security Detection Rule Generation Capabilities. The paper tests 16 model configurations on 50 tasks built from real attack simulations on Azure infrastructure.

Overall model performance on CTI-REALM-50, with Claude Opus 4.6 leading and the top three spots going to Anthropic models.

Highlights:

CTI-REALM evaluates the full detection engineering loop inside a controlled environment where agents can retrieve threat reports, map MITRE techniques, inspect schemas, run KQL queries, and submit both a Sigma rule and a working KQL query.
The benchmark has two versions: CTI-REALM-25 for quick testing and CTI-REALM-50 for full evaluation. The 50 tasks are derived from 37 recreated attacks covering Linux endpoints, Azure Kubernetes Service (AKS), and Azure cloud activity, all based on public threat reports and detection references.
Scores come from 5 checkpoints, but final detection quality carries 65% of the total.
On CTI-REALM-50, Claude Opus 4.6 (High) ranked first at 0.64, Claude Opus 4.5 second at 0.62, and Claude Sonnet 4.5 third at 0.59. The best GPT-5 variant reached 0.57. O4-Mini came last at 0.36.
The biggest performance gap showed up in query iteration. Claude models scored 0.86-0.92 on this checkpoint, while most GPT-5 variants were below 0.50 and GPT-4.1 was at 0.02.
More reasoning did not help GPT-5 in this benchmark. Medium reasoning outperformed high reasoning across GPT-5, GPT-5.1, and GPT-5.2.
Cloud detection was by far the hardest category. The best model scored just 0.28 on cloud tasks, versus 0.59 for Linux and 0.52 for AKS. All 8 cloud tasks required correlating multiple log sources, so this is the part of the benchmark that looks most like advanced cloud threat hunting.

My take:

AI-for-SOC benchmarks matter to buyers because running a real SOC proof of concept is expensive. Secure coding and exploitation already have benchmarks like SecRepoBench, CyberSecEval, BountyBench, and Cybench. On the SOC side, CTI-REALM joins ExCyTIn-Bench and CyberSOCEval, but the stack is still much thinner, especially for realistic detection engineering.
The query-iteration gap is the real story inside this benchmark. CTI-REALM mostly rewards the final detection rule, but the biggest separation between models showed up earlier, in the iterate-and-refine loop. That matters because query iteration is how detection engineers actually validate and improve detection logic. A model that cannot self-correct on KQL output is not useful in a real SOC.
AI does not look impressive on cloud-heavy SOC work. Cloud tasks required correlating multiple log sources, and no model cracked 0.30. I expect specialized models like Sec-Gemini to be significantly more cloud-grounded. Side note: I still feel Gemini is shaky on using gcloud cli and often lost on Vertex and AI Studio differences.
More reasoning doesn't help. My read is that in this workflow, faster query-test-correct loops matter more than longer internal deliberation.
Preloaded workflow guidance doubled GPT-5-Mini's ATT&CK-mapping score, from 0.210 to 0.420, but did not improve query iteration at all. My read is that memory and RAG can help with context, but they do not fix weak detection-engineering capability.

Highlights:

My take:

Sources: