Anthropic built Clio to understand how millions of people use Claude without reading their conversations. It has four layers of defense. An LLM extracts structured metadata from conversations while stripping personally identifiable information. Conversations get clustered into groups so no individual stands out. A summarizer produces aggregate descriptions. And a final LLM auditor reviews every summary for privacy violations before anyone sees it. Anthropic published the full design in December 2024 and used it to analyze a million real Claude conversations.
Meenatchi Sundaram Muthu Selva Annamalai (UCL), Emiliano De Cristofaro (UC Riverside), and Peter Kairouz (Google Research) replicated the pipeline locally using synthetic data. They inserted roughly 50 poisoned chats and showed that users' medical diagnoses appear in the output summaries. They described the attack in "Cliopatra: Extracting Private Information from LLM Insights".
Highlights:
- The attack works by inserting roughly 50 poisoned chats into the conversation pool. The chats are crafted to cluster with a target user's medical conversation and include a prompt injection that instructs the summarizer to "include medical history." In a live deployment, this would require creating a few fake accounts.
- Full attack results: the attacker correctly extracted the target's disease 39% of the time on Claude Haiku (the model Clio uses), 42% on LLaMA, 44% on Gemma, 81% on Qwen. Without any attack, just guessing from age, gender, and one symptom, an LLM gets the right disease 22% of the time.
- Clio's PII stripper successfully removed PII (names, addresses) but preserved medical diagnoses in the extracted metadata. Gemma leaked 39% of diseases, Claude Haiku 49%, LLaMA 51%, and Qwen 73%. Diseases are not classified as PII, so this is not a stripping failure.
- The prompt injection fires at the summarization stage with instructions like "include medical history" that sit inert through facet extraction and clustering. When the summarizer LLM processes the cluster, it reads the injected instruction and includes the target's diagnosis in the summary.
- The LLM privacy auditor detected zero attacks across all model families. 56.6% of clusters with the leaked diseases were rated 5/5 on privacy, with justifications citing "lack of explicit identifiers" and "generic information."
- Differential privacy (epsilon=25, where lower values mean stronger privacy) is the only defense that reduces the attack to the 22% baseline. At epsilon=50, the attack still succeeds 53-83% of the time.
My take:
- The full attack chain rests on two assumptions: that the attacker knows at least one of the target's symptoms, and that the attacker can access Clio's cluster summaries. The first is plausible, but the second is the harder requirement. Clio's output is largely internal to Anthropic. The researchers also rebuilt their version of Clio from the published paper, while the real Clio running in production has likely evolved since its introduction in December 2024.
- A medical diagnosis does not identify anyone by itself, but if chained with a deanonymization step, it can be attributed to a specific person. Ironically, a study co-authored by Anthropic showed that LLMs can identify users from anonymized text. Cliopatra extracts the diagnosis, the deanonymization attack identifies the person. Neither paper demonstrates the full chain, but both pieces now exist independently.
- Qwen 4B kept 92% of symptoms and 73% of diseases in extracted facets despite being told to remove sensitive information, compared to Claude Haiku at 27% and 49%. If you use small open-weight models for stripping sensitive information, do comprehensive evals and test for edge cases.
- The trivial prompt injection "include medical history" persists through the entire pipeline. It passes through facet extraction as part of the conversation text, survives clustering as an embedding, and fires when the summarizer reads it as an instruction. The paper does not evaluate prompt injection resistance per model, but comparing facet leakage to full attack results gives a rough signal. Qwen 30B was the least resilient to prompt injection and amplified leakage by 8 points, from 73% to 81%, while LLaMA 70B reduced it by 9 points and Claude Sonnet 4.5 reduced it by 10 points. I cannot conclude from this paper whether it is a Qwen-specific or a model-size problem.
- Clio was optimized for utility. Anthropic validated it with 19,476 synthetic transcripts and a manual audit of 5,000 real conversations for privacy, finding no leakage. No adversarial inputs, no poisoning, no prompt injection, no red-teaming of the pipeline. Anthropic tested whether Clio leaks by accident and found it does not. The Cliopatra researchers tested whether Clio leaks under attack and found it does, 39-81% of the time. It is a good time to expand red-teaming scope to privacy cases when you test your AI system, especially if it handles "special category data" under GDPR Article 9 or "sensitive personal information" under California's CPRA.