Same breach data, different LLM password resets
TL;DR On identical breach data, LLMs swing between org-wide and targeted password resets, defaulting to whichever they generate first.
Cisco investigated the use of AI in incident response reporting.
AI-written reports look right and polished, but are full of inconsistencies.
Nate Pors and the Talos IR AI Tiger Team forecast a 50% cut in drafting time for tabletop exercise reports. They tested how ChatGPT, Claude, and Gemini handle the task.
Highlights:
- Inconsistency in research and sourcing. A model pulls from different websites across separate runs, so the underlying data shifts and outcomes stop being repeatable.
- Inconsistency in conclusions. Given identical breach data, one run recommends a full organization-wide password reset and another a targeted reset. The model defaults to whichever recommendation it generates first.
- Inconsistency in output format. Token-by-token generation makes document structure and section layouts fluctuate between runs, breaking the standardized executive summaries and recommendation sections that formal reports require.
- Inconsistency from context drift and pollution. When the context window fills, the model discards earlier information and loses initial instructions. Running multiple unrelated tasks in one session causes 'context pollution' that blends outputs.
My take:
- The format and context drift problems are clearly solvable. Single-task prompts, fixed templates, better grounding in facts. The same boundary appears in Microsoft's test of whether AI can replace detection engineers.
- The harder problem is the best-practice ground truth. Models store statistical patterns in their weights, not discrete facts. When AI pentesters lacked ground truth, 8 of 13 frameworks fabricated their own success.
- AI is rapidly advancing on autonomous tasks with verifiable outcomes. But the judgment calls, scoping the incident, choosing recommendations, and driving change management, will stay human. At least for now. LLM security engineering agents still succeed on only 18% of real-world tasks.
Sources:
Nate Pors. AI-Generated Reporting: Lessons from Cisco Talos Incident Response. Cisco Blog.