← Back to Paper List

Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M

Dario Di Palma, Felice Antonio Merra, Maurizio Sfilio, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia
Politecnico di Bari, Cognism
arXiv (2025)
Recommendation Memory Factuality Benchmark

πŸ“ Paper Summary

LLM Evaluation Recommender Systems Data Contamination
This study demonstrates that popular LLMs like GPT-4o and Llama-3 have memorized significant portions of the MovieLens-1M dataset, leading to inflated recommendation performance and amplified popularity bias.
Core Problem
LLMs are increasingly used for recommendation tasks, but it is unclear if their performance stems from genuine reasoning or simply recalling the specific evaluation datasets (like MovieLens-1M) memorized during pre-training.
Why it matters:
  • Benchmarking on memorized datasets (data leakage) yields non-generalizable results, creating a false sense of progress in Recommender Systems research
  • Memorization amplifies biases, causing models to over-recommend popular items that were more frequent in their training data
  • Unfair comparisons arise when LLMs with prior knowledge of the test set are compared against standard recommenders trained from scratch
Concrete Example: When prompted to complete a user's profile, GPT-4o can reconstruct the exact age, occupation, and zip code for 16.52% of users in MovieLens-1M without seeing the data in context. Similarly, it can recall 80.76% of item titles given just an ID, proving it 'knows' the test set.
Key Novelty
Recommendation Dataset Memorization Framework
  • Defines formal 'Memorization Coverage' metrics to quantify how many items, user profiles, and interactions an LLM can reproduce exactly from memory via prompting
  • Establishes a direct correlation between the degree of memorization (leakage) and the model's performance on standard recommendation tasks
Evaluation Highlights
  • GPT-4o memorizes 80.76% of MovieLens-1M items (titles from IDs) and 16.52% of user attributes (exact matches)
  • High memorization correlates with inflated performance: GPT-4o achieves HR@1 of 0.2796, vastly outperforming the standard BPRMF baseline (HR@1 of 0.0406)
  • Strong popularity bias: GPT-4o retrieves 89.06% of the top-20% most popular items but only 63.97% of the least popular ones
Breakthrough Assessment
8/10
Critically important meta-analysis for the RecSys field. It challenges the validity of a significant portion of recent LLM-based recommendation research by proving massive data leakage.
×