Backdoors in large models are real, but how do we find them?

Our next NeurIPS paper is "ICLScan: Detecting Backdoors in Black-Box Large Language Models via Targeted In-context Illumination" that I read, so you don’t have to.

What is a backdoor? It is a hidden trigger introduced during model training to elicit malicious or unauthorized outputs from the model activated by a key phrase.

Read more about backdoors in Anthropic’s "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training" published last year. I put a link

Typical backdoor defenses:

However, red teaming or access to the weights of a proprietary model is rarely possible.

The good news is that backdoored LLMs show "susceptibility amplification". They absorb new triggers from targeted In-Context Learning (ICL) prompts.

The authors showed how to determine whether the model is backdoored by using just 10-20 ICL queries and measuring the model’s injection success rate.

My take:

The full paper

Anthropic's Sleeper Agents paper