Still not perfect, but it’s an almost 2X jump from September 2025, when the best result was ~0.37 by o4-mini.

Yiran Wu UPenn, Mauricio Velazco Microsoft, and the team developed ExCyTIn-Bench to evaluate models on SOC-style investigations.

My takeaways:

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

updated_eval_results_12_25.png