ERASE -- A Real-World Aligned Benchmark for Unlearning in Recommender Systems

📝 Paper Summary

Machine Unlearning Recommender Systems Privacy

ERASE is a large-scale benchmark for machine unlearning in recommender systems that evaluates diverse tasks, real-world unlearning scenarios, and operational efficiency across seven algorithms and nine datasets.

Core Problem

Existing unlearning benchmarks for recommenders focus narrowly on collaborative filtering and unrealistic 'one-shot' deletion of large data chunks, ignoring sequential requests and diverse tasks like session-based recommendation.

Why it matters:

Legal regulations (GDPR) and security needs (removing spam) require efficient data deletion, but current methods are often too slow or degrade model utility.
Real-world systems face continuous, small-scale deletion requests (e.g., users withdrawing consent), not the massive single-batch deletions simulated in prior benchmarks.
Prior benchmarks overlook critical recommendation tasks like Next-Basket and Session-Based Recommendation, which differ significantly from standard Collaborative Filtering.

Concrete Example: A user suffering from addiction requests the removal of all interactions with alcohol products. Current benchmarks simulate this by deleting random 5% chunks of training data, failing to capture the specific, sensitive nature of this request or the need to process it immediately without full retraining.

Key Novelty

ERASE Benchmark

Introduces sequential unlearning of small batches to mimic real-time requests (e.g., removing sensitive items user-by-user) rather than single large-batch deletions.
Expands evaluation scope beyond Collaborative Filtering to include Session-Based and Next-Basket Recommendation, using 9 diverse datasets.
Provides 600GB of pre-computed artifacts (checkpoints, logs) to allow researchers to test new unlearning methods without expensive model pre-training.

Architecture

Overview of the ERASE benchmark pipeline including tasks, unlearning scenarios, algorithms, and evaluation metrics.

Evaluation Highlights

Retraining takes up to 24 hours, while efficient unlearning methods (like SCIF) reduce this latency by 3+ orders of magnitude.
Recommender-specific unlearning methods (SCIF, GIF) consistently outperform general-purpose methods (from NeurIPS competition) in stability and utility preservation.
General-purpose methods often fail on recurrent/attention-based architectures (GRU4Rec, SASRec), sometimes degrading utility significantly compared to retraining.

Breakthrough Assessment

8/10

Significantly advances the field by aligning evaluation with real-world constraints (sequential requests, diverse tasks) and releasing massive artifacts to lower barriers for future research.

⚙️ Technical Details

Problem Definition

Setting: Machine Unlearning in Recommender Systems

Inputs: Trained recommendation model, Retain Set Dr, Forget Set Df (stream of batches)

Outputs: Updated model parameters behaving as if trained only on Dr

Pipeline Flow

Base Model Training (train on full dataset D)
Unlearning Request Stream (receive sequence of small forget batches)
Sequential Unlearning (update model for each batch)
Evaluation (measure Utility, Effectiveness, Efficiency)

System Modules

Base Recommender

Predict user preferences (items/baskets)

Model or implementation: Various (LightGCN, SASRec, GRU4Rec, etc.)

Unlearning Algorithm

Update model parameters to remove influence of Df

Model or implementation: SCIF, GIF, Kookmin, etc.

Novel Architectural Elements

Sequential unlearning pipeline evaluating cumulative degradation over time
Task-agnostic interface supporting CF, SBR, and NBR models

Modeling

Base Model: 9 models: LightGCN, SGL, Matrix Factorization (CF); GRU4Rec, NARM, SASRec, SR-GNN (SBR); DNNTSP, Sets2Sets (NBR)

Training Method: Approximate Unlearning (Gradient/Hessian updates)

Objective Functions:

Purpose: Remove influence of specific data points.

Formally: Various (e.g., Influence Functions minimizing loss on retain vs forget set).

Adaptation: Unlearning updates applied to pre-trained weights

Trainable Parameters: Full model or subset depending on unlearning method

Training Data:

9 public datasets (MovieLens, Food, Amazon, 30music, NowP, TaFeng, Dunnhumby, ValuedShopper)
Forget sets constructed via scenarios: Sensitive Item removal (users removing specific category) and Poisonous Data removal (removing spam injections)

Key Hyperparameters:

spam_injection_rate: 1%
spam_batch_size: 256 users
sensitive_unlearning_rate: Sequentially by user
+ 1 more
total_sensitive_removed: 0.01% of interactions

Compute: Consumed >13,000 GPU hours for benchmark generation. Unlearning times vary from seconds to minutes.

Comparison to Prior Work

vs. CURE4Rec: ERASE covers SBR/NBR tasks (CURE4Rec only CF), uses sequential/streaming unlearning (CURE4Rec uses one-shot), and includes general ML unlearning baselines.
vs. RecEraser/UltraRE: ERASE focuses on approximate methods linear in forget set size, excluding these exact methods due to inefficiency for continuous small requests.

Limitations

Focuses only on approximate unlearning methods linear in forget set size; excludes exact methods like SISA/RecEraser.
Effectiveness metrics (RelItems, RelEff) are empirical proxies, not theoretical guarantees of privacy.
General-purpose unlearning methods (NeurIPS winners) performed poorly on recommender tasks, suggesting need for domain-specific adaptation.

Reproducibility

Code: https://github.com/deem-data/erase-bench

Highly reproducible. Code, config, and 600GB of artifacts (1000+ checkpoints, logs) released at https://github.com/deem-data/erase-bench/.

📊 Experiments & Results

Evaluation Setup

Sequential unlearning of data streams on pre-trained recommenders

Benchmarks:

Collaborative Filtering Tasks (Rating/Interaction Prediction)
Session-based Recommendation Tasks (Next-item Prediction)
Next-basket Recommendation Tasks (Next-basket Prediction)

Metrics:

Recall
nDCG (Normalized Discounted Cumulative Gain)
RelItems (Relative Sensitive Items Remaining)
RelEff (Relative Efficiency/Utility vs Retraining)
Runtime (Latency)
Statistical methodology: Experiments run over 5 random seeds.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results indicate that recommender-specific unlearning methods (like SCIF) are significantly faster than retraining while maintaining comparable utility.
ERASE (General)	Speedup vs Retraining	1.0	1000.0	+999.0
Unlearning effectiveness varies by architecture; recurrent/attention models are harder to unlearn than graph/matrix models using general-purpose methods.
Sensitive Item Unlearning	Robustness	Unstable	Stable	Improved

Main Takeaways

Approximate unlearning can match retraining utility in some settings but varies widely across datasets and model architectures.
General-purpose unlearning methods (NeurIPS competition winners) often fail on sequential recommender architectures (RNNs, Attention), highlighting a gap in current research.
Sequential unlearning of small batches is a distinct challenge; methods that work for one-shot deletion may degrade stability over repeated updates.
Recommender-specific unlearning approaches (SCIF, GIF) generally offer more robust performance than adapted computer vision/NLP methods.

📚 Prerequisite Knowledge

Prerequisites

Collaborative Filtering (CF)
Session-based Recommendation (SBR)
Next-basket Recommendation (NBR)
Machine Unlearning concepts (Exact vs. Approximate)

Key Terms

Machine Unlearning (MU): The process of removing specific data points from a trained model so it behaves as if it never saw them.

Collaborative Filtering (CF): Recommendation approach predicting preferences based on user-item interaction patterns (e.g., ratings).

Session-based Recommendation (SBR): Recommendation task predicting the next item in a short-term sequence or session.

Next-basket Recommendation (NBR): Recommendation task predicting a set of items (basket) a user will purchase next.

Forget Set: The subset of training data marked for deletion.

Retain Set: The subset of training data that should remain in the model.

Exact Unlearning: Retraining from scratch or using methods mathematically guaranteed to match the retrained distribution.

Approximate Unlearning: Modifying model parameters to estimate the effect of retraining, trading theoretical guarantees for speed.

Influence Functions: A technique to estimate how model parameters would change if a specific training point were removed, often used for approximate unlearning.

nDCG: Normalized Discounted Cumulative Gain—a ranking metric that values correct items appearing higher in the recommendation list.

Recall: The fraction of relevant items that are successfully retrieved.

RelItems: A metric measuring how many users still receive sensitive item recommendations after unlearning, compared to a retrained model.

RelEff: A metric comparing the utility (nDCG) of an unlearned model against a model retrained from scratch.

SCIF: Second-order Correction with Influence Functions—an unlearning method using Hessian-based updates.

GIF: Graph Influence Functions—an unlearning method adapting influence functions for Graph Neural Networks.

CEU: Certified Edge Unlearning—a method for unlearning edges in GNNs with theoretical guarantees for linear models.

Kookmin: A heuristic unlearning method that resets specific parameters and fine-tunes on retain data.

Fanchuan: A heuristic unlearning method using KL-divergence minimization and contrastive loss.

Seif: A heuristic unlearning method adding noise to parameters followed by fine-tuning.