Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

📝 Paper Summary

LLM-based Recommendation Data Leakage Evaluation Methodology

This paper investigates how pre-training data leakage distorts recommender system evaluation by simulating contamination via LoRA, revealing that in-domain exposure inflates metrics while out-of-domain exposure degrades them.

Core Problem

LLM-based recommender systems may inadvertently memorize benchmark test data during pre-training, leading to artificially inflated performance metrics that do not reflect true recommendation capabilities.

Why it matters:

Current evaluations fail to distinguish between genuine user interest modeling and mere memorization of data artifacts
The integrity of leaderboards and benchmarks is compromised if models have seen the test set (data leakage)
Prior studies on leakage focus on QA or generation; the specific impact on recommendation (user interest/item representation) is unexplored

Concrete Example: If an LLM has memorized the interaction history of a specific user from the training corpus during its pre-training, it might recommend the correct 'next item' during testing based on memory rather than by learning the user's actual preference patterns, invalidating the test result.

Key Novelty

Simulating Benchmark Leakage via Dirty LLMs

Constructs a 'Dirty LLM' by fine-tuning a clean base model on a controlled mix of in-domain (target test data) and out-of-domain data using LoRA (Low-Rank Adaptation)
Isolates the variable of 'leakage' by keeping the base model frozen and only updating adapters, creating a controlled proxy for pre-training contamination
Identifies a 'Dual-Effect' where domain-relevant leakage boosts performance deceptively, while domain-irrelevant leakage harms it

Architecture

The complete experimental workflow for simulating and evaluating benchmark leakage.

Breakthrough Assessment

8/10

Identifies a critical, overlooked flaw in current LLM-Rec evaluation practices. The methodology for simulating leakage via LoRA is a clever, scalable way to audit model trust without full pre-training.

⚙️ Technical Details

Problem Definition

Setting: Evaluating the robustness of LLM-based recommender systems against training data leakage

Inputs: Clean Base LLM parameters θ0, Mixed Leakage Dataset D_leak

Outputs: Dirty LLM with parameters θ_dirty (Base + LoRA adapters)

Pipeline Flow

Data Construction: Sample In-Domain (ID) and Out-Of-Domain (OOD) data -> Create Mixed Leakage Dataset
Contamination Simulation: Clean LLM + Leakage Dataset -> LoRA Fine-Tuning -> Dirty LLM
Downstream Evaluation: Clean/Dirty LLM -> Recommendation Backbone -> Performance Comparison

System Modules

Leakage Data Constructor

Create controlled contamination datasets

Model or implementation: Sampling Algorithm

Contamination Adapter

Inject leakage knowledge into the model

Model or implementation: LoRA (Low-Rank Adaptation)

Downstream Recommender

Perform recommendation task using the potentially contaminated backbone

Model or implementation: LLMRec or LLMRec+Collab variants

Novel Architectural Elements

Simulation framework using LoRA adapters as a proxy for 'memorized artifacts' from pre-training, allowing controlled study of leakage without expensive full pre-training

Modeling

Base Model: Vicuna-7B

Training Method: Low-Rank Adaptation (LoRA)

Objective Functions:

Purpose: Train LoRA adapters to memorize leakage data.

Formally: L_LLM = - Σ log P(y|x; θ_0 + Δθ_LoRA) (Standard Next-Token Prediction NLL)

Adaptation: LoRA (Only adapters A and B are optimized; base θ0 is frozen)

Training Data:

In-Domain (ID): Randomly sample 10% of target dataset
Out-of-Domain (OOD): Sampled from 6 external datasets (Epinions, Last.fm, MIND, Amazon-Sports, Amazon-Beauty, Gowalla)
Total Leakage Size: |D_leak| = 7 * |D_ID| (1 part ID + 6 parts OOD)

Key Hyperparameters:

ID_sampling_ratio: 0.1
OOD_source_count: 6

Compute: Not reported in the paper

Comparison to Prior Work

vs. Carlini et al. (2021): Extends data leakage analysis from general text generation/QA to the specific domain of Recommender Systems
vs. Standard Evaluation: Introduces a 'Dirty LLM' baseline to explicitly measure the gap between true capability and memorization-induced inflation
Novelty: First study to experimentally simulate and quantify benchmark leakage in LLM-based recommendation [not cited in paper]

Limitations

Simulates leakage via LoRA (fine-tuning) rather than full pre-training, which is a conservative lower-bound estimate
Requires constructing specific mixed datasets which may not perfectly mirror real-world chaotic pre-training data distributions
Focuses on specific LLM-Rec architectures; impact might vary for other integration paradigms

Reproducibility

Code: https://github.com/yusba1/LLMRec-Data-Leakage

Code is publicly available at https://github.com/yusba1/LLMRec-Data-Leakage. The paper specifies the base model (Vicuna-7B) and the datasets used for leakage construction (Epinions, Last.fm, etc.). Exact LoRA rank and alpha values are not present in the provided text.

📊 Experiments & Results

Evaluation Setup

Comparison of Recommendation Performance (AUC/UAUC) between Clean LLMs and Dirty LLMs (fine-tuned on leakage data)

Benchmarks:

Target Evaluation Datasets (Sequential Recommendation / Top-K Recommendation)

Metrics:

AUC (Area Under Curve)
UAUC (User Averaged AUC)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Triple Effect of Leakage: Leakage produces three distinct outcomes: Spurious Gains (from ID data), Stability, or Degradation (from OOD data).
In-domain (ID) leakage acts as a 'trap,' creating substantial but fake performance improvements that mask the model's actual inability to generalize.
Out-of-domain (OOD) leakage acts as contamination, typically degrading recommendation accuracy by interfering with item characteristics learning.
The 'Dirty LLM' simulation via LoRA effectively isolates the impact of memorization, proving that even lightweight parameter updates can significantly distort benchmark results.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and fine-tuning
Familiarity with Recommender Systems (collaborative filtering, user/item representation)
Knowledge of Low-Rank Adaptation (LoRA)

Key Terms

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained model weights and injects trainable rank decomposition matrices

In-domain (ID) leakage: Contamination where the model is exposed to data from the same domain/dataset as the target evaluation benchmark (e.g., seeing the test set)

Out-of-domain (OOD) leakage: Contamination where the model is exposed to data from external, unrelated sources

LLMRec: A recommendation paradigm that uses LLMs directly for recommendation with minimal architectural changes (e.g., via prompting)

LLMRec+Collab: A recommendation paradigm that integrates collaborative filtering signals (like matrix factorization embeddings) into the LLM's input space

Dirty LLM: The model resulting from fine-tuning a clean base model on the leakage dataset to simulate contamination