Google's new audit shows 3 of 4 unlearning methods fail to forget

TL;DR: The only method that truly erases data keeps training on it under random labels. The clever alternatives leave fingerprints an output-only statistical test can detect. For frontier LLMs there is no affordable proof of forgetting yet: the audit itself requires a $100M+ retrain.

Google researchers asked a model to forget, but did it?

There are 4 main unlearning methods: fine-tuning, pruning, parameter dampening, and random-label. But there was no robust way to prove how well a model actually forgets.

So Mónica Ribero from Google Research built a framework to audit machine unlearning. It generalizes MMD, the field's default statistical test for comparing two sets of samples, co-invented by her co-author Arthur Gretton. The code is public.

Highlights:

The naive check: retrain the model from scratch without the forgotten data, then compare the two statistically. That method doesn't work, because training randomness simply makes two models look different.
The fix is a three-way comparison. The audit measures the statistical difference between the tested model's outputs and two references: the retrained copy and the original that still contains the data. A smaller difference from the original means the data is still in there.

How the three-way comparison flags failed unlearning. Source: Google Research.

The audit showed that on a small image classifier asked to forget 10 training images, fine-tuning, pruning, and parameter dampening failed. Only random-label unlearning worked, the method that keeps training on the forgotten images under random labels.

Only random-label unlearning passed the three-way comparison. Source: Google Research.

The same audit catches broken differential privacy implementations with just 5,000 output samples vs the millions previous methods needed. Google's previous auditing library, DP-Auditorium, missed the flawed mechanism entirely. In the Cliopatra attack on Anthropic's Clio, differential privacy at epsilon 25 was the only defense that held, and a guarantee like that only protects if the implementation is correct.

My take:

Model unlearning is a big deal, especially as part of compliance with privacy regulations. The burden of proof is on the model owner, and they need a mechanism to demonstrate that the data is really gone. A related need shows up after an attack: poisoned agent memories are hard to detect and weed out.
The irony of random-label unlearning is that it makes the model forget the data by continuing to train on it.
The catch-22: the audit needs outputs from a model retrained without the data, so proving you did not need to retrain requires doing the retraining anyway. That is fine for a small classifier and a fantasy for a frontier LLM, where a pretraining run costs $100M+.

Highlights:

My take:

Sources: