Google DeepMind showed how activation probes can detect AI cyber misuse in a 1M context window 10,000x cheaper than LLM-based classifiers

That matters because cyber capabilities are a double-edged sword. Zero-days with LLMs can be found by the bad guys too.

János Kramár and the team at Google DeepMind just released a great paper "Building Production-Ready Probes For Gemini". They built cyber misuse classifiers for Gemini 2.5 Flash by using activation probes and validated them with jailbreaks in long context.

Highlights:

  • Activation probes are tiny classifiers that read the model’s internal signals while it processes a prompt, so you can score misuse risk without running a separate LLM guard on every request.
  • Long context makes misuse harder to catch because a few malicious lines can be buried inside a huge repo-sized prompt, so their probe designs focus on spotting the most suspicious snippet instead of averaging the whole prompt.
  • The production pattern is a cascade: run the cheap probe on all traffic, and defer only uncertain cases to an LLM-based classifier to improve quality while keeping costs low.

My take:

  1. Frontier models cyber capabilities are evolving at an astonishing pace and it’s absolutely critical to address their misuse. See my yesterday post on Opus 4.6
  2. Misuse detection is a hard problem. The challenge is not only to detect what, but to do it in the context of who.
  3. Cost and latency are the most critical factors. Running LLM-based classifiers at inference time on all traffic is not feasible, so the activation probes approach is spot on.

Sources:

Building Production-Ready Probes For Gemini