Junfei Tan, Yuxin Chen, An Zhang, Junguang Jiang, Bin Liu, Ziru Xu, Han Zhu, Jian Xu, Bo Zheng, Xiang Wang
National University of Singapore
arXiv
(2025)
RecommendationRLP13N
📝 Paper Summary
Generative RecommendationReinforcement Learning with Verifiable Rewards (RLVR)LLM-based Recommendation
ReRe adapts reinforcement learning to generative recommendation by using constrained beam search for efficient negative sampling and an explicit ranking reward to penalize hard negatives.
Core Problem
Existing generative recommenders rely on low-quality negative sampling (random or static) and implicit likelihood-based rewards, leading to weak supervision and poor discriminative ability.
Why it matters:
Implicit rewards in methods like DPO are based on likelihood margins rather than true user preferences, making them prone to reward hacking where metrics improve but recommendation quality drops
Standard RLVR (like GRPO) fails in recommendation because the constrained item space leads to duplicate samples (low efficiency) and sparse binary rewards (all negatives get zero reward)
Concrete Example:In a standard RLVR setup for recommendation, a model might sample 16 items. If the target is 'Item A', and the model generates 'Item A' once and 'Item B' (an irrelevant item) 15 times, standard rule-based rewards assign 1 to 'Item A' and 0 to all 'Item B's. This treats all negatives equally, ignoring that some might be 'harder' (more plausible) negatives that require stronger penalties.
Key Novelty
Reinforced Preference Optimization for Recommendation (ReRe)
Integrates constrained beam search into the RLVR sampling phase to efficiently generate diverse, valid, and hard negative items in a single pass, avoiding the redundancy of random sampling
Augments binary rule-based rewards with a 'ranking reward' that penalizes generated negatives based on their probability rank, providing fine-grained supervision beyond simple correctness
Architecture
The ReRe framework pipeline contrasting standard generation with the ReRe training process.
Evaluation Highlights
Achieves relative gains of 27.13% on Amazon Toys and 12.40% on Amazon Industrial datasets compared to traditional and LLM-based baselines
Ranking reward formulation further improves NDCG@K by 3.95% on Amazon Toys compared to standard rule-based rewards, validating the benefit of fine-grained supervision
Generalizes effectively across both base LLMs and SFT-initialized models, outperforming methods like D3 and S-DPO
Breakthrough Assessment
8/10
Successfully adapts RLVR (a hot topic in reasoning) to recommendation by addressing domain-specific challenges (constrained space, sparse rewards). Strong empirical gains and clear methodological motivation.
⚙️ Technical Details
Problem Definition
Setting: Generative recommendation where a model generates a target item title given a user's interaction history
Inputs: User interaction history prompt x_u
Outputs: Target item title i_t (selected from a fixed corpus of N items)
Pipeline Flow
Prompt Construction (History -> Text)
LLM Generation (Constrained)
Output Verification (Item Validity)
System Modules
Prompt Constructor
Formats user interaction history into a natural language prompt
Model or implementation: Rule-based template
Generative Recommender
Generates the target item title token-by-token
Model or implementation: Qwen2-0.5B (Base or SFT)
Modeling
Base Model: Qwen2-0.5B
Training Method: Reinforced Preference Optimization (ReRe), based on GRPO
Objective Functions:
Purpose: Maximize expected reward using importance sampling and clipping (GRPO).
Formally: L = E[ min( (pi/pi_old) * A, clip(...) * A ) - beta * KL ]
Purpose: Compute Advantage by normalizing rewards within a sampled group.
Formally: A_k = (r_k - mean(r)) / std(r)
Purpose: Calculate Total Reward per sample combining correctness and ranking quality.
Code is publicly available at https://github.com/sober-clever/ReRe. Datasets are standard public benchmarks (Amazon, Yelp). Implementation details (hyperparameters, GPUs) are explicitly provided.
📊 Experiments & Results
Evaluation Setup
Next-item prediction (sequential recommendation)
Benchmarks:
Amazon Toys and Games (Sequential Recommendation)
Amazon Industrial and Scientific (Sequential Recommendation)
Yelp (Sequential Recommendation)
Metrics:
HR@K (Hit Ratio)
NDCG@K (Normalized Discounted Cumulative Gain)
Statistical methodology: Not explicitly reported in the paper
Experiment Figures
Analysis of reward hacking with different reward types (Semantic, Collaborative).
Comparison of Reward Margins in S-DPO vs ReRe.
Main Takeaways
ReRe consistently outperforms traditional (SASRec) and LLM-based (BigRec, TIGER, D3, S-DPO) recommenders across all datasets, showing the value of on-policy sampling and verifiable rewards.
Constrained Beam Search is superior to dynamic or common sampling for RL in recommendation, as it ensures diversity and validity in the constrained output space.
The proposed Ranking Reward (penalizing high-probability negatives) is more effective than dense proxy rewards like Semantic or Collaborative rewards, which are prone to reward hacking (improving reward scores but degrading ranking metrics).
Performance gains are robust across different backbone scales and families, and ReRe works well with both Base and SFT model initializations.
📚 Prerequisite Knowledge
Prerequisites
Generative Recommendation
Reinforcement Learning (RL)
Large Language Models (LLMs)
Key Terms
RLVR: Reinforcement Learning with Verifiable Rewards—RL methods that use objective, checkable correctness signals (like math answers or valid code) rather than learned reward models
GRPO: Group Relative Preference Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of samples generated from the same prompt
SFT: Supervised Fine-Tuning—training on correct input-output pairs before RL alignment
DPO: Direct Preference Optimization—an offline method aligning models to preferences using static pairs of chosen/rejected responses
Beam Search: A search algorithm that explores a graph by expanding the most promising node in a limited set
Constrained Decoding: Forcing the language model to generate only tokens that form valid item titles from a predefined corpus
Hard Negatives: Incorrect items that the model assigns high probability to; distinguishing these from the target is crucial for ranking performance