Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models

📝 Paper Summary

Machine Unlearning in Generative Language Models Privacy and Compliance (GDPR, Right To Be Forgotten)

ICU enables generative models to forget sensitive data by iteratively optimizing an unlearning loss while simultaneously maintaining linguistic capabilities through contrastive learning on analogous, safe text pairs.

Core Problem

Existing unlearning methods either require access to the full original training data (often unavailable) or, when applied to generative models, cause 'model collapse' where general linguistic capabilities are lost.

Why it matters:

Regulations like GDPR require the 'Right To Be Forgotten,' forcing AI developers to delete specific private data from trained models
Directly maximizing negative log-likelihood on forget sets often destroys the model's ability to generate coherent text, leading to repetitive or nonsensical outputs
Retraining large models from scratch for every deletion request is computationally prohibitive

Concrete Example: When trying to make a model forget 'Harry Potter' details, a standard unlearning method (KUMPR) causes the model to collapse and repeat 'and I have a Harry Potter' endlessly. ICU successfully forgets the specific facts (e.g., 'J.K. Rowling') while retaining the ability to write grammatically correct sentences about similar topics.

Key Novelty

Iterative Contrastive Unlearning (ICU)

Constructs an 'Analogous Set' of data related to the forget set but containing different facts, used to anchor the model's general capabilities
Applies a three-part loss: maximizing error on forget data (unlearning), minimizing error on analogous data (learning), and minimizing KL divergence from the original model (stability)
Iteratively updates the forget set by removing samples that are successfully forgotten based on dynamic thresholds, preventing over-unlearning

Architecture

The ICU framework pipeline showing the three main modules: Knowledge Unlearning Induction, Contrastive Learning Enhancement, and Iterative Unlearning Refinement.

Evaluation Highlights

Achieves best normalized score (0.64) balancing unlearning and performance on GPT-Neo 1.3B, surpassing the KUMPR baseline (-0.37)
Maintains low perplexity (16.20 on Pile-val) comparable to the original model (16.03), whereas KUMPR degrades to 41.76
Effective unlearning: Extraction Likelihood drops from 0.40 (Original) to 0.04, matching the dedicated unlearning baseline KUMPR (0.04)

Breakthrough Assessment

7/10

Offers a practical, effective solution for the 'model collapse' problem in unlearning without needing original training data. Strong empirical results on balancing forgetting vs. utility.

⚙️ Technical Details

Problem Definition

Setting: Given a forget set D_fgt and a pre-trained model f_theta, modify parameters theta to minimize retention of D_fgt while maintaining performance on other data.

Inputs: A set of target sequences to be forgotten (D_fgt) and access to the trained model parameters.

Outputs: Updated model parameters theta that no longer generate the forget set sequences.

Pipeline Flow

Analogous Data Construction (Retrieve similar docs from Wiki)
KNN Sampling (Pair forget samples with analogous samples)
Iterative Training Loop (Unlearning + Learning + Refinement)

System Modules

Knowledge Unlearning Induction (KUI)

Maximize the negative log-likelihood of the target sequences to induce forgetting

Model or implementation: Target GLM (e.g., GPT-Neo)

Contrastive Learning Enhancement (CLE)

Minimize NLL on paired analogous data and minimize KL divergence from original model to preserve general capabilities

Model or implementation: Target GLM + Original Frozen Model (for KL)

Iterative Unlearning Refinement (IUR)

Dynamically evaluate if a sample is 'forgotten' using BERTScore/BLEU and remove it from the training set for subsequent epochs

Model or implementation: Evaluation Metrics (BERTScore, BLEU)

Novel Architectural Elements

Iterative Contrastive Unlearning framework integrating dynamic dataset refinement with a dual-objective loss (unlearning + contrastive learning)

Modeling

Base Model: GPT-Neo (125M, 1.3B, 2.7B), Opt (125M, 1.3B, 2.7B), TinyLlama 1.1B

Training Method: Gradient-based fine-tuning with mixed objective

Objective Functions:

Purpose: Forget specific target sequences.

Formally: Maximize -log P(x_t | x_<t) for x in D_fgt
Purpose: Maintain ability to process similar concepts.

Formally: Minimize -log P(x_t | x_<t) for x in D_lrn (analogous data)
Purpose: Prevent drift from original model distribution on retained knowledge.

Formally: Minimize KL(P_theta || P_theta0)

Training Data:

Forget set: 128 samples from Pile subset (extraction benchmark)
Analogous set: Documents from Wiki same category as forget set
KNN matching using all-MiniLM-L6-v2 to pair forget samples with analogous samples

Key Hyperparameters:

learning_rate: 5e-6
optimizer: Adam
alpha: 0.5
+ 4 more
beta: 1.0
K (KNN): 1
unlearning_threshold_bertscore: 0.3
unlearning_threshold_bleu: 0.01

Compute: 1x RTX 3090 (125M), 3x RTX 3090 (1.3B), 6x RTX 3090 (2.7B). Batch size 4-8.

Comparison to Prior Work

vs. KUMPR: ICU adds contrastive learning (analogous data) and iterative refinement to prevent model collapse, whereas KUMPR often destroys generation quality.
vs. SISA/KGA: ICU does not require access to the full original training dataset, making it feasible for deployed GLMs.
vs. DPO (unlearning variant): ICU explicitly models the preservation of general capabilities via KL divergence and analogous data learning.

Limitations

Requires constructing an 'Analogous Set' which relies on the availability of similar public data (e.g., Wikipedia).
Performance depends on the quality of the sentence transformer used for KNN matching.
Iterative evaluation adds computational overhead compared to single-pass unlearning methods.

Reproducibility

Code: https://github.com/himalalps/ICU

📊 Experiments & Results

Evaluation Setup

Targeted unlearning of specific sequences from the Pile dataset, followed by evaluation of unlearning success and general model utility.

Benchmarks:

Pile Subset (Extraction Benchmark) (Data Extraction / Unlearning Target)
Downstream Classification Tasks (General NLU (Hellaswag, Lambada, Winogrande, Piqa, etc.))
Dialogue Tasks (Conversation Generation (Wizard of Wikipedia, Empathetic Dialogues))

Metrics:

Extraction Likelihood (EL)
Memorization Accuracy (MA)
Perplexity (PPL)
BERTScore
BLEU
Downstream Task Accuracy/F1
Information Entropy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison on GPT-Neo 1.3B shows ICU achieves the best balance. Note: Lower EL/MA/PPL is better; Higher Acc/F1 is better.
Pile Subset	Extraction Likelihood (EL)	0.40	0.04	-0.36
Pile Subset	Memorization Accuracy (MA)	0.47	0.09	-0.38
Pile Validation	Perplexity (PPL)	41.76	16.20	-25.56
Wikitext	Perplexity (PPL)	78.48	23.46	-55.02
Downstream Tasks (Avg)	Accuracy	0.50	0.54	+0.04
Ablation study on loss components (GPT-Neo 1.3B).
Overall Utility	Normalized Score	0.33	0.64	+0.31

Experiment Figures

Comparison of text generation between original model, KUMPR, and ICU given a prompt about Harry Potter.

Evolution of generated text across training epochs.

Main Takeaways

ICU effectively unlearns sensitive information (matching dedicated baselines like KUMPR) without the catastrophic performance degradation (model collapse) seen in those baselines.
The 'Analogous Data' strategy allows the model to retain linguistic structure and knowledge of similar concepts while forgetting specific facts.
Iterative refinement prevents 'over-unlearning' by dynamically removing samples once they meet the forgetting criteria (low BERTScore/BLEU).
Larger models (2.7B) show a higher tendency to memorize sensitive info, making unlearning more critical; ICU scales effectively to these sizes.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Gradient Ascent for unlearning (maximizing loss)
Knowledge of Kullback-Leibler (KL) divergence
Familiarity with Transformer-based Generative Language Models (GLMs)

Key Terms

Machine Unlearning: The process of removing specific knowledge or data points from a trained machine learning model

Model Collapse: A failure mode where a generative model loses its diversity or linguistic structure, often outputting repetitive or garbage text

Negative Log-Likelihood (NLL): A loss function commonly used in training language models; minimizing it improves prediction, maximizing it (gradient ascent) induces forgetting

KL Divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution

Extraction Likelihood (EL): A metric measuring the success rate of extracting specific training sequences from a model via generation

Memorization Accuracy (MA): A metric quantifying how accurately a model can complete a given prefix from the training data

Analogous Set: A constructed dataset containing information similar in category to the forget set but with different key concepts, used to preserve model capabilities

BLEU: Bilingual Evaluation Understudy—a metric for evaluating the quality of text which has been machine-translated from one natural language to another

BERTScore: A metric for text generation evaluation that computes similarity using contextual embeddings from BERT rather than exact n-gram matching

Sentence Transformer: A modification of the BERT network that uses siamese networks to derive semantically meaningful sentence embeddings

GLM: Generative Language Model—AI models designed to generate text, such as GPT or Llama