Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M

📝 Paper Summary

LLM Evaluation Recommender Systems Data Contamination

This study demonstrates that popular LLMs like GPT-4o and Llama-3 have memorized significant portions of the MovieLens-1M dataset, leading to inflated recommendation performance and amplified popularity bias.

Core Problem

LLMs are increasingly used for recommendation tasks, but it is unclear if their performance stems from genuine reasoning or simply recalling the specific evaluation datasets (like MovieLens-1M) memorized during pre-training.

Why it matters:

Benchmarking on memorized datasets (data leakage) yields non-generalizable results, creating a false sense of progress in Recommender Systems research
Memorization amplifies biases, causing models to over-recommend popular items that were more frequent in their training data
Unfair comparisons arise when LLMs with prior knowledge of the test set are compared against standard recommenders trained from scratch

Concrete Example: When prompted to complete a user's profile, GPT-4o can reconstruct the exact age, occupation, and zip code for 16.52% of users in MovieLens-1M without seeing the data in context. Similarly, it can recall 80.76% of item titles given just an ID, proving it 'knows' the test set.

Key Novelty

Recommendation Dataset Memorization Framework

Defines formal 'Memorization Coverage' metrics to quantify how many items, user profiles, and interactions an LLM can reproduce exactly from memory via prompting
Establishes a direct correlation between the degree of memorization (leakage) and the model's performance on standard recommendation tasks

Evaluation Highlights

GPT-4o memorizes 80.76% of MovieLens-1M items (titles from IDs) and 16.52% of user attributes (exact matches)
High memorization correlates with inflated performance: GPT-4o achieves HR@1 of 0.2796, vastly outperforming the standard BPRMF baseline (HR@1 of 0.0406)
Strong popularity bias: GPT-4o retrieves 89.06% of the top-20% most popular items but only 63.97% of the least popular ones

Breakthrough Assessment

8/10

Critically important meta-analysis for the RecSys field. It challenges the validity of a significant portion of recent LLM-based recommendation research by proving massive data leakage.

⚙️ Technical Details

Problem Definition

Setting: Probing LLMs for exact retrieval of database records (items, users, interactions) via zero-shot or few-shot prompts

Inputs: A prompt containing a partial identifier (e.g., 'MovieID::1') or a sequence of interactions

Outputs: The missing metadata (e.g., 'Toy Story::Animation') or the next interaction in the sequence

Pipeline Flow

Select Target Data (Items, Users, or Interactions)
Construct Prompt (Hand-engineered few-shot prompt)
Query LLM (Temperature=0 for deterministic output)
Exact Match Verification (Compare output to ground truth)

System Modules

Item Probing (Data Extraction)

Assess if LLM maps ItemID to attributes

Model or implementation: Target LLM (e.g., GPT-4o, Llama-3.1)

User Probing (Data Extraction)

Assess if LLM maps UserID to demographic profile

Model or implementation: Target LLM

Interaction Probing (Data Extraction)

Assess if LLM predicts next specific interaction given history

Model or implementation: Target LLM

Modeling

Base Model: Analyzes multiple families: GPT (GPT-4o, GPT-3.5 turbo) and Llama (Llama-3.3 70B, Llama-3.1 405B/70B/8B, Llama-3.2 3B/1B)

Compute: Inference only. Temperature set to 0. Seed fixed to 42.

Comparison to Prior Work

vs. Carlini et al.: Focuses specifically on structured Recommender Systems datasets (User/Item/Ratings) rather than general text
vs. Standard RS Baselines (e.g., LightGCN): Demonstrates that LLM 'state-of-the-art' results are likely due to training on the test set rather than superior architecture

Limitations

Study limited to MovieLens-1M; other datasets might be less contaminated
Exact match metric might underestimate memorization (paraphrased recall is not counted)
Does not explore methods to de-contaminate or mitigate the memorization
Relying on prompting might not reveal all memorized knowledge (lower bound)

Reproducibility

Code availability mentioned as 'publicly available at: GitHub' but the URL is missing from the provided text snippet. The study uses standard open models (Llama) and accessible APIs (OpenAI), and the dataset (MovieLens-1M) is public. Prompts are provided in Figures 1 and 2.

📊 Experiments & Results

Evaluation Setup

Probe models for exact reproduction of MovieLens-1M data and compare recommendation performance on the same dataset

Benchmarks:

MovieLens-1M (Data Extraction / Recommendation)

Metrics:

Items' Memorization Coverage (Cov_I)
Users' Memorization Coverage (Cov_U)
Interaction Memorization Coverage (Cov_R)
Hit Rate (HR@k)
nDCG@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Quantifying how much of the MovieLens-1M dataset various LLMs have memorized.
MovieLens-1M	Cov_I (Item Coverage)	1.93	80.76	+78.83
MovieLens-1M	Cov_I (Item Coverage)	5.82	15.09	+9.27
MovieLens-1M	Cov_U (User Coverage)	5.84	17.38	+11.54
Evaluating recommendation performance to show the link between memorization and high scores.
MovieLens-1M	HR@1	0.0406	0.2796	+0.2390
MovieLens-1M	HR@1	0.0687	0.1975	+0.1288
Popularity bias analysis showing popular items are memorized at much higher rates.
MovieLens-1M	Item Retrieval Rate (Top 20% Pop)	63.97	89.06	+25.09
MovieLens-1M	Item Retrieval Rate (Top 20% Pop)	2.29	13.48	+11.19

Main Takeaways

All analyzed LLMs exhibit non-trivial memorization of MovieLens-1M, with GPT-4o retrieving over 80% of items and significant portions of user/interaction data.
There is a clear positive relationship between model size, extent of memorization, and downstream recommendation performance, suggesting that 'state-of-the-art' LLM results on this dataset are unreliable.
Memorization is highly skewed towards popular items (popularity bias), meaning LLM-based recommenders may exacerbate the 'rich-get-richer' effect and fail to generalize to long-tail items.
Evaluation of LLMs on public datasets like MovieLens-1M requires extreme caution, as the test set is likely part of the model's pre-training corpus.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Recommender Systems evaluation (Train/Test splits)
Familiarity with Large Language Models and pre-training datasets
Concept of Data Leakage/Contamination in Machine Learning

Key Terms

Memorization Coverage: The percentage of items, users, or interactions in a dataset that an LLM can reproduce exactly when prompted with their identifiers

MovieLens-1M: A classic recommender systems dataset containing 1 million ratings from 6,000 users on 4,000 movies

Popularity Bias: The tendency of a model to recommend or memorize items that appear frequently in the data, often ignoring niche or 'long-tail' items

HR@1: Hit Rate at 1—a metric measuring whether the single top-recommended item is the correct ground-truth item

nDCG: Normalized Discounted Cumulative Gain—a ranking metric that accounts for the position of relevant items in the recommendation list

Data Leakage: When test data is inadvertently included in the training set, allowing the model to 'cheat' by memorizing answers rather than generalizing

Zero-shot prompting: Asking the model to perform a task without providing any examples in the prompt

Few-shot prompting: Providing a few examples of the task (e.g., input-output pairs) in the prompt to guide the model