That matters because cyber capabilities are a double-edged sword. Zero-days with LLMs can be found by the bad guys too.
János Kramár and the team at Google DeepMind just released a great paper "Building Production-Ready Probes For Gemini". They built cyber misuse classifiers for Gemini 2.5 Flash by using activation probes and validated them with jailbreaks in long context.
Highlights:
- Activation probes are tiny classifiers that read the model’s internal signals while it processes a prompt, so you can score misuse risk without running a separate LLM guard on every request.
- Long context makes misuse harder to catch because a few malicious lines can be buried inside a huge repo-sized prompt, so their probe designs focus on spotting the most suspicious snippet instead of averaging the whole prompt.
- The production pattern is a cascade: run the cheap probe on all traffic, and defer only uncertain cases to an LLM-based classifier to improve quality while keeping costs low.
My take:
- Frontier models cyber capabilities are evolving at an astonishing pace and it’s absolutely critical to address their misuse. See my yesterday post on Opus 4.6
- Misuse detection is a hard problem. The challenge is not only to detect what, but to do it in the context of who.
- Cost and latency are the most critical factors. Running LLM-based classifiers at inference time on all traffic is not feasible, so the activation probes approach is spot on.