Deploying an LLM and reasonably worrying about backdoors?

Backdoors in large models are real, but how do we find them?

Our next NeurIPS paper is "ICLScan: Detecting Backdoors in Black-Box Large Language Models via Targeted In-context Illumination" that I read, so you don’t have to.

What is a backdoor? It is a hidden trigger introduced during model training to elicit malicious or unauthorized outputs from the model activated by a key phrase.

Read more about backdoors in Anthropic’s "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" published last year.

Typical backdoor defenses:

  • Red teaming
  • Input/output safety filters
  • White‑box backdoor scanners (e.g., BAIT, CLIBE) that assume you can read model weights

However, red teaming or access to the weights of a proprietary model is rarely possible.

The good news is that backdoored LLMs show "susceptibility amplification". They absorb new triggers from targeted In-Context Learning (ICL) prompts.

The authors showed how to determine whether the model is backdoored by using just 10-20 ICL queries and measuring the model’s injection success rate.

My take:

  • ICLScan, if done right, could be a great quick-check self-service tool for enterprise AI teams.
  • The method depends on exact backdoor knowledge and might fail if the scanner’s prompts are drawn from a different data distribution (OOD) or don’t match the attacker’s hidden target sequence.
  • The scanner hasn’t been tested on large or very large models (e.g., GPT-5) to confirm if its utility holds.
  • Overall, it’s a great idea that can be productized by an AI security company for top-of-the-funnel customer acquisition.

Sources:

  1. ICLScan: Detecting In-Context Learning Backdoors in Language Models
  2. Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training