SPRec: Self-Play to Debias LLM-based Recommendation

📝 Paper Summary

LLM-based Recommendation Bias Mitigation in Recommender Systems Preference Alignment

SPRec uses a self-play mechanism where an LLM treats its own biased predictions as negative samples during Direct Preference Optimization (DPO), adaptively suppressing over-recommended items without external data.

Core Problem

Direct Preference Optimization (DPO) in recommender systems inherently amplifies popularity bias because its optimal policy disproportionately favors items with high occurrence in the training data, leading to severe homogeneity and filter bubbles.

Why it matters:

Current alignment methods like DPO degrade user experience by narrowing recommendations to a few popular items (filter bubbles)
Existing bias mitigation strategies rely on manual rules or external knowledge, which limits their general applicability
LLMs naturally suffer from token-level biases (favoring common words) and item-level biases (favoring popular concepts like 'Batman'), which post-training exacerbates

Concrete Example: In a movie recommendation task, a standard DPO-tuned model might exclusively recommend the 'Batman' series regardless of user history because it appears frequently in training. SPRec detects this over-recommendation by treating the model's own 'Batman' prediction as a negative sample in the next round, suppressing it.

Key Novelty

Self-Play Recommendation Tuning (SPRec)

Iterative self-correction: The model undergoes rounds of Supervised Fine-Tuning (SFT) followed by DPO.
Dynamic negative sampling: Instead of random negatives, the DPO step uses the model's own predictions from the previous iteration as negative samples.
Adaptive suppression: By treating its own high-probability outputs as rejected samples in DPO, the model learns to penalize and down-weight items it is currently over-recommending (biases).

Architecture

The iterative self-play framework of SPRec.

Evaluation Highlights

Outperforms standard DPO by +28.9% in fairness (MGU metric) on MovieLens-1M while maintaining or improving accuracy
Reduces popularity bias significantly: Recommendations for the most popular group of items dropped from ~95% (DPO) to a balanced level comparable to SFT (~25%) in cold-start settings
Achieves Pareto improvement: simultaneously improves accuracy (NDCG@10) and fairness compared to baselines like KTO and IPO

Breakthrough Assessment

7/10

Identifies a critical flaw in applying DPO to recommendation (bias amplification) and provides an elegant, data-free solution via self-play. High practical value for LRS alignment.

⚙️ Technical Details

Problem Definition

Setting: Top-K item recommendation using LLMs as a generative backbone

Inputs: User interaction history and context x

Outputs: Ordered list of recommended item names y

Pipeline Flow

Initialization: SFT on offline data
Iteration Loop:
1. Inference (Self-Generation): Generate recommendations using current model
2. Data Construction: Pair offline positive samples (y_w) with generated self-predictions as negatives (y_l)
3. Optimization: Update model using SFT loss + DPO loss with self-generated negatives

System Modules

Base LLM

Generative model predicting items based on user context

Model or implementation: Llama-2-7b-hf / Llama-2-13b-hf

Novel Architectural Elements

Iterative Self-Play Loop: The pipeline uniquely feeds the model's inference output back into the training loop as negative samples for the next DPO step

Modeling

Base Model: Llama-2-7b-hf and Llama-2-13b-hf

Training Method: Self-Play Iterative DPO

Objective Functions:

Purpose: Maintain basic recommendation capability.

Formally: L_SFT on positive data (x, y_w)
Purpose: Suppress biased items generated by the model itself.

Formally: L_DPO where y_w is ground truth and y_l is model's prediction y_pred from previous step

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: LoRA parameters (rank=8, alpha=16)

Training Data:

Positives (y_w): Offline interaction logs
Negatives (y_l): Generated by the model itself during the self-play phase

Key Hyperparameters:

learning_rate: 2e-4
batch_size: 128
beta_dpo: 0.1
+ 4 more
self_play_iterations: 3
lora_r: 8
lora_alpha: 16
max_context_length: 1024

Compute: 8 * NVIDIA A800 80G GPUs

Comparison to Prior Work

vs. DPO: SPRec uses self-generated negatives to target model bias, whereas DPO uses random or static negatives which amplifies popularity bias
vs. IPO/KTO: SPRec incorporates an iterative self-correction loop rather than a single static optimization step
vs. SPIN [not cited in paper]: Similar self-play concept for LLMs, but SPRec applies it specifically to debiasing recommendation distributions rather than general text quality

Limitations

Computational cost increases linearly with the number of self-play iterations
Requires carefully balanced mixing of SFT and DPO losses to prevent catastrophic forgetting of positive preferences
Analysis focused on popularity bias; other types of fairness (e.g., gender, race) less explicitly explored

Reproducibility

Code: https://github.com/RegionCh/SPRec

Code is publicly available at https://github.com/RegionCh/SPRec. Datasets (MovieLens, Goodreads) are public. Hyperparameters are detailed in the appendix.

📊 Experiments & Results

Evaluation Setup

Top-K Item Recommendation

Benchmarks:

MovieLens-1M (Movie Recommendation)
Goodreads (Book Recommendation)

Metrics:

NDCG@10 (Accuracy)
Recall@10 (Accuracy)
MGU (Missing Group Utility - Fairness/Bias)
JSD (Jensen-Shannon Divergence - Fairness/Bias)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SPRec consistently outperforms baselines in both accuracy (NDCG@10) and fairness (MGU, where lower is better) on MovieLens-1M.
MovieLens-1M	NDCG@10	0.0242	0.0392	+0.0150
MovieLens-1M	MGU (Fairness - Lower is better)	0.197	0.140	-0.057
Goodreads	NDCG@10	0.0270	0.0294	+0.0024
Goodreads	MGU (Fairness - Lower is better)	0.145	0.076	-0.069
Ablation studies show that using Self-Play Negatives is crucial for performance compared to Random Negatives.
MovieLens-1M	NDCG@10	0.0305	0.0392	+0.0087

Experiment Figures

Bar chart comparing recommendation probability across popularity groups for SFT, DPO, and SPRec.

Conceptual illustration of Forward KL (mass-covering) vs Reverse KL (mode-seeking).

Main Takeaways

Standard DPO exacerbates popularity bias, often performing worse than SFT in fairness metrics.
SPRec effectively mitigates this by penalizing the model's own over-confident predictions via the self-play mechanism.
The method is robust, showing improvements across two datasets (MovieLens, Goodreads) and different model sizes (7B, 13B).
Theoretical analysis suggests DPO's reverse KL-divergence objective inherently seeks modes (peaks), causing it to latch onto popular items; SPRec's dynamic re-weighting counters this.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Direct Preference Optimization (DPO) loss function
Familiarity with Recommender Systems metrics (NDCG, Recall)
Basic knowledge of KL-divergence (forward vs. reverse)

Key Terms

DPO: Direct Preference Optimization—an alignment method optimizing a policy to prefer chosen answers over rejected ones without a separate reward model

SFT: Supervised Fine-Tuning—training the model on positive examples (user history -> target item) using standard cross-entropy loss

Self-Play: A training mechanism where the model improves by interacting with its own previous versions; here, using its own outputs as negative feedback

MGU: Missing Group Utility—a fairness metric measuring the distribution mismatch between ground-truth user preferences and model recommendations

Reverse KL-divergence: A statistical measure minimized by DPO that encourages mode-seeking behavior (focusing on peaks), often leading to popularity bias

Forward KL-divergence: A statistical measure minimized by SFT that encourages mass-covering behavior (averaging the distribution), generally less biased than reverse KL

Filter bubble: A state where a recommender system isolates a user in a cultural or ideological bubble by showing only items they are already likely to agree with or know

LRS: LLM-based Recommendation System—using Large Language Models to perform recommendation tasks