That matters because cyber capabilities are a double-edged sword. Zero-days with LLMs can be found by the bad guys too.

János Kramár and the team at Google DeepMind just released a great paper "Building Production-Ready Probes For Gemini". They built cyber misuse classifiers for Gemini 2.5 Flash by using activation probes and validated them with jailbreaks in long context.

Highlights:

My take:

  1. Frontier models cyber capabilities are evolving at an astonishing pace and it’s absolutely critical to address their misuse. See my yesterday post on Opus 4.6
  2. Misuse detection is a hard problem. The challenge is not only to detect what, but to do it in the context of who.
  3. Cost and latency are the most critical factors. Running LLM-based classifiers at inference time on all traffic is not feasible, so the activation probes approach is spot on.

Building Production-Ready Probes For Gemini