← Back to Paper List

RLHF Fine-Tuning of LLMs for Alignment with Implicit User Feedback in Conversational Recommenders

Zhongheng Yang, Aijia Sun, Yushang Zhao, Yinuo Yang, Dannier Li, Chengrui Zhou
Northeastern University, Northwestern University, University of Nebraska-Lincoln, Washington University in St. Louis, Columbia University
arXiv (2025)
Recommendation RL P13N

📝 Paper Summary

Conversational Recommender Systems (CRS) Reinforcement Learning with Human Feedback (RLHF)
This paper aligns Large Language Models in conversational recommenders by using Reinforcement Learning to optimize for implicit signals like dwell time and sentiment rather than just next-token prediction.
Core Problem
Traditional supervised fine-tuning of conversational recommenders relies on static labels and fails to capture dynamic, implicit user signals like dwell time, sentiment changes, or partial engagement.
Why it matters:
  • Supervised models often generate generic responses that don't adapt to user satisfaction in real-time
  • Explicit feedback (ratings) is sparse, whereas implicit feedback (clicks, hesitation) is abundant but noisy and hard to optimize for using standard losses
  • Misalignment between the model's training objective (text generation) and the user's goal (finding relevant items) leads to poor personalization
Concrete Example: A supervised model might recommend a popular movie simply because it appears in training data, ignoring that the user just expressed a 'sad' sentiment in the chat. The proposed model detects the sentiment shift and optimizes its policy to suggest uplifting content to maximize the 'sentiment shift' reward.
Key Novelty
Implicit Feedback Reward Modeling for RLHF
  • Constructs a composite reward function from implicit signals (simulated dwell time, sentiment polarity shift, semantic relevance) instead of using explicit human preference labels
  • Fine-tunes the recommender policy using PPO (Proximal Policy Optimization) to maximize this composite 'implicit satisfaction' score directly within the dialogue generation loop
Evaluation Highlights
  • +13.7% improvement in Hit Rate@5 on the REDIAL dataset compared to a supervised GPT-2 baseline
  • +13.8% improvement in NDCG@5 on the OpenDialKG dataset, showing better ranking of relevant items
  • +17.1% gain in 'Satisfaction' (a composite metric of engagement and sentiment) on REDIAL after RLHF tuning
Breakthrough Assessment
7/10
Solid application of RLHF to the specific domain of conversational recommendation using implicit signals. While the feedback is simulated in experiments, the methodology addresses a key gap in aligning CRS with latent user preferences.
×