Backdoors in large models are real, but how do we find them?
Our next NeurIPS paper is "ICLScan: Detecting Backdoors in Black-Box Large Language Models via Targeted In-context Illumination" that I read, so you don’t have to.
What is a backdoor? It is a hidden trigger introduced during model training to elicit malicious or unauthorized outputs from the model activated by a key phrase.
Read more about backdoors in Anthropic’s "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" published last year. I put a link
Typical backdoor defenses:
- Red teaming
- Input/output safety filters
- White‑box backdoor scanners (e.g., BAIT, CLIBE) that assume you can read model weights
However, red teaming or access to the weights of a proprietary model is rarely possible.
The good news is that backdoored LLMs show "susceptibility amplification". They absorb new triggers from targeted In-Context Learning (ICL) prompts.
The authors showed how to determine whether the model is backdoored by using just 10-20 ICL queries and measuring the model’s injection success rate.
My take:
- ICLScan, if done right, could be a great quick-check self-service tool for enterprise AI teams.
- The method depends on exact backdoor knowledge and might fail if the scanner’s prompts are drawn from a different data distribution (OOD) or don’t match the attacker’s hidden target sequence.
- The scanner hasn’t been tested on large or very large models (e.g., GPT-5) to confirm if its utility holds.
- Overall, it’s a great idea that can be productized by an AI security company for top-of-the-funnel customer acquisition.