0.5% attack success rate, outperforming GPT-4 and Gemini 3, and comparable to GPT-5 on agentic workflows.
Sizhe Chen at Meta FAIR published SecAlign, the first open-source LLM with commercial-grade prompt injection resilience.
Why does it matter? The strongest prompt injection defenses have been locked inside proprietary models like GPT-5 and Gemini-3-Pro.
How does it work?
- A new `input` message type separates trusted instructions from untrusted data — a one-line code change for developers.
- DPO training teaches the model to follow user instructions and ignore injected instructions hiding in the data.
- Randomized injection positions and self-generated responses fix shortcut learning and label quality issues from the initial version of SecAlign.
Results:
- Attack success rate drops to 0–2% across benchmarks, down from 53–99% undefended.
- No meaningful utility drop — the first defense to preserve the undefended model's performance.
- More secure than GPT-4o, Gemini-2.5-Flash, and Gemini-3-Pro on most PI benchmarks, and comparable to GPT-5 on agentic workflows.
My take:
- The security through obscurity of commercial model defense recipes is not moving us forward towards robust prompt injection protections.
- SecAlign++ democratizes AI security. We need model-level prompt injection defenses in open-source models.
- The prompt injection threat is far from solved. The model is still vulnerable to strong adaptive attacks (47.3% GCG success rate on 70B).
Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks