← Back to Paper List

Direct Preference Optimization for LLM-Enhanced Recommendation Systems

Chao Sun, Yaobo Liang, Yaming Yang, Shilin Xu, Tianmeng Yang, Yunhai Tong
Peking University
arXiv (2024)
Recommendation RL P13N Reasoning

📝 Paper Summary

LLM-Enhanced Recommendation LLM Alignment / Preference Optimization
DPO4Rec aligns LLM-generated reasoning with recommendation objectives by using the downstream recommender's performance metrics to construct preference pairs for Direct Preference Optimization.
Core Problem
LLMs often fail to generate optimal features for recommendation systems because their pre-training objectives (next-token prediction) do not align with recommendation tasks (ranking), and they lack feedback from the recommendation model itself.
Why it matters:
  • Zero-shot instruction tuning produces reasoning that sounds plausible to humans but may not actually help the numerical recommendation model improve accuracy
  • Traditional LLM alignment (RLHF) relies on human preference, which is expensive and may not correlate with recommendation metrics like NDCG or CTR
  • Existing methods treat the LLM and the Recommender as separate stages without a feedback loop, missing bidirectional optimization opportunities
Concrete Example: An LLM might summarize a user as 'likes action movies,' which is true but generic. However, a reasoning trace emphasizing 'prefers 90s action movies with specific actors' might yield a higher NDCG score when fed to the recommender. Standard prompting doesn't know which trace works better; DPO4Rec learns to generate the latter.
Key Novelty
Recommender-Feedback Alignment Loop
  • Instead of using a separate neural reward model trained on human data, DPO4Rec uses the *actual recommendation model's performance* (e.g., NDCG score) to evaluate LLM outputs
  • Constructs preference pairs (Chosen vs. Rejected) by sampling N reasoning traces and selecting the ones that result in the highest and lowest recommendation accuracy
  • Applies Direct Preference Optimization (DPO) to fine-tune the LLM to autonomously generate the 'high-performing' reasoning types
Architecture
Architecture Figure Figure 2
The DPO4Rec framework workflow, illustrating the cycle of reasoning generation, reward modeling via recommender scoring, and DPO alignment.
Evaluation Highlights
  • Outperforms KAR (ChatGPT-4o enhanced baseline) by 3.92% in NDCG@5 on Amazon-Beauty using PRM backbone
  • Achieves +1.45% MAP@5 improvement over the DLCM backbone on ML-1M dataset using Llama3.1-8B
  • Consistent improvements across 3 datasets (ML-1M, Amazon-Books, Amazon-Beauty) and 3 backbones (DLCM, PRM, SetRank)
Breakthrough Assessment
7/10
Novel application of DPO using system performance as the ground-truth signal rather than human preference. Strong empirical results, though the core innovation is a clever application of existing DPO mechanics to a new domain.
×