Microsoft tested if AI can replace detection engineers

Anthropic, OpenAI, Cursor, and Microsoft all shipped AI-powered security tools in the past two months. The cybersecurity industry keeps panicking about being vibe-coded out of business. But how much of the detection workflow can AI actually handle, from threat description to production-grade detection?

Fatih Bulut, Carlo DePaolis, Raghav Batta, and Anjali Mangal at Microsoft published "AVDA: Autonomous Vibe Detection Authoring for Cybersecurity," accepted to FSE Companion 2026. They adapted Andrej Karpathy's "vibe coding" concept to detection engineering: describe a threat in natural language, let AI generate the detection logic.

Highlights:

  • Input is a structured detection spec pointing to a MITRE ATT&CK technique, platform, and language; three workflows (zero-shot, RAG-based sequential, and agentic with tool access) generate detection code from that description.
  • 92 production detections across 5 platforms (KQL, PySpark, Scala), tested with 11 models and 3 workflows, generating 5,796 artifacts. Expert validation on 22 of 92 detections: Spearman rho = 0.64, p < 0.002.
  • 10 binary quality criteria: TTP matching 99.4%, syntax validity 95.9%, library usage 61.9%, data source correct 31.7%, logic equivalence 18.4%, schema accuracy 17.6%, exclusion parity 8.9%.
  • Agentic workflows scored 19% above zero-shot Baseline (0.447 vs 0.375 mean similarity). Sequential achieved 87% of Agentic quality at 40x lower token cost.
  • All top-5 configurations are agentic. GPT-5 at medium reasoning leads at 0.578, and medium reasoning achieves 94% of high-tier quality at lower cost.
  • Only OpenAI models tested, no Claude or Gemini. The LLM judge (GPT-4.1) is also one of the 11 evaluated models, and no detections were executed against real or synthetic telemetry.
Evaluation criteria pass rates by workflow and platform scope
Evaluation criteria pass rates by workflow and platform scope
Top-5 leaderboard by model, reasoning tier, and approach
Top-5 leaderboard by model, reasoning tier, and approach
Best-performing configuration per model over time under agentic workflow
Best-performing configuration per model over time under agentic workflow

My take:

  1. Testing only OpenAI models diminishes the value of the insights, so it reads more like a vendor benchmark than independent research.
  2. The rule validation method using embedding similarity, an LLM as a judge, and sample human review (a Spearman rho of 0.64) can be criticized, challenging the 99.4% rule-to-threat match, but the trend is clear. Detection rule writing is being commoditized by frontier models.
  3. The missing piece for frontier labs is the lack of real telemetry data. That is why the 13% gap is in schema accuracy and library usage, and 82% of generated detections diverge from the reference logic.
  4. Start capturing every tuning decision in structured, machine-readable form now. That history is the one asset AI can't generate, and the prerequisite to adopt AI for detection engineering.
  5. Frontier labs need access to real and diverse telemetry to improve their model capabilities. I expect OpenAI playing Google strategy and acquiring a cybersec vendor soon.

Sources:

  1. AVDA: Autonomous Vibe Detection Authoring for Cybersecurity (Bulut, DePaolis, Batta, Mangal, 2026)
  2. Towards Autonomous Detection Engineering: Embedding-Based Retrieval and MCP Orchestration (ACSAC 2025 Case Study)
  3. Microsoft benchmark for LLM performance on end-to-end SOC tasks (The Weather Report)
  4. OpenAI explains why Codex Security doesn't include SAST (The Weather Report)