Microsoft tested if AI can replace detection engineers
Anthropic, OpenAI, Cursor, and Microsoft all shipped AI-powered security tools in the past two months. The cybersecurity industry keeps panicking about being vibe-coded out of business. But how much of the detection workflow can AI actually handle, from threat description to production-grade detection?
Fatih Bulut, Carlo DePaolis, Raghav Batta, and Anjali Mangal at Microsoft published "AVDA: Autonomous Vibe Detection Authoring for Cybersecurity," accepted to FSE Companion 2026. They adapted Andrej Karpathy's "vibe coding" concept to detection engineering: describe a threat in natural language, let AI generate the detection logic.
Highlights:
- Input is a structured detection spec pointing to a MITRE ATT&CK technique, platform, and language; three workflows (zero-shot, RAG-based sequential, and agentic with tool access) generate detection code from that description.
- 92 production detections across 5 platforms (KQL, PySpark, Scala), tested with 11 models and 3 workflows, generating 5,796 artifacts. Expert validation on 22 of 92 detections: Spearman rho = 0.64, p < 0.002.
- 10 binary quality criteria: TTP matching 99.4%, syntax validity 95.9%, library usage 61.9%, data source correct 31.7%, logic equivalence 18.4%, schema accuracy 17.6%, exclusion parity 8.9%.
- Agentic workflows scored 19% above zero-shot Baseline (0.447 vs 0.375 mean similarity). Sequential achieved 87% of Agentic quality at 40x lower token cost.
- All top-5 configurations are agentic. GPT-5 at medium reasoning leads at 0.578, and medium reasoning achieves 94% of high-tier quality at lower cost.
- Only OpenAI models tested, no Claude or Gemini. The LLM judge (GPT-4.1) is also one of the 11 evaluated models, and no detections were executed against real or synthetic telemetry.
My take:
- Testing only OpenAI models diminishes the value of the insights, so it reads more like a vendor benchmark than independent research.
- The rule validation method using embedding similarity, an LLM as a judge, and sample human review (a Spearman rho of 0.64) can be criticized, challenging the 99.4% rule-to-threat match, but the trend is clear. Detection rule writing is being commoditized by frontier models.
- The missing piece for frontier labs is the lack of real telemetry data. That is why the 13% gap is in schema accuracy and library usage, and 82% of generated detections diverge from the reference logic.
- Start capturing every tuning decision in structured, machine-readable form now. That history is the one asset AI can't generate, and the prerequisite to adopt AI for detection engineering.
- Frontier labs need access to real and diverse telemetry to improve their model capabilities. I expect OpenAI playing Google strategy and acquiring a cybersec vendor soon.
Sources:
- AVDA: Autonomous Vibe Detection Authoring for Cybersecurity (Bulut, DePaolis, Batta, Mangal, 2026)
- Towards Autonomous Detection Engineering: Embedding-Based Retrieval and MCP Orchestration (ACSAC 2025 Case Study)
- Microsoft benchmark for LLM performance on end-to-end SOC tasks (The Weather Report)
- OpenAI explains why Codex Security doesn't include SAST (The Weather Report)