Reinforced Preference Optimization for Recommendation

📝 Paper Summary

Generative Recommendation Reinforcement Learning with Verifiable Rewards (RLVR) LLM-based Recommendation

ReRe adapts reinforcement learning to generative recommendation by using constrained beam search for efficient negative sampling and an explicit ranking reward to penalize hard negatives.

Core Problem

Existing generative recommenders rely on low-quality negative sampling (random or static) and implicit likelihood-based rewards, leading to weak supervision and poor discriminative ability.

Why it matters:

Implicit rewards in methods like DPO are based on likelihood margins rather than true user preferences, making them prone to reward hacking where metrics improve but recommendation quality drops
Standard RLVR (like GRPO) fails in recommendation because the constrained item space leads to duplicate samples (low efficiency) and sparse binary rewards (all negatives get zero reward)

Concrete Example: In a standard RLVR setup for recommendation, a model might sample 16 items. If the target is 'Item A', and the model generates 'Item A' once and 'Item B' (an irrelevant item) 15 times, standard rule-based rewards assign 1 to 'Item A' and 0 to all 'Item B's. This treats all negatives equally, ignoring that some might be 'harder' (more plausible) negatives that require stronger penalties.

Key Novelty

Reinforced Preference Optimization for Recommendation (ReRe)

Integrates constrained beam search into the RLVR sampling phase to efficiently generate diverse, valid, and hard negative items in a single pass, avoiding the redundancy of random sampling
Augments binary rule-based rewards with a 'ranking reward' that penalizes generated negatives based on their probability rank, providing fine-grained supervision beyond simple correctness

Architecture

The ReRe framework pipeline contrasting standard generation with the ReRe training process.

Evaluation Highlights

Achieves relative gains of 27.13% on Amazon Toys and 12.40% on Amazon Industrial datasets compared to traditional and LLM-based baselines
Ranking reward formulation further improves NDCG@K by 3.95% on Amazon Toys compared to standard rule-based rewards, validating the benefit of fine-grained supervision
Generalizes effectively across both base LLMs and SFT-initialized models, outperforming methods like D3 and S-DPO

Breakthrough Assessment

8/10

Successfully adapts RLVR (a hot topic in reasoning) to recommendation by addressing domain-specific challenges (constrained space, sparse rewards). Strong empirical gains and clear methodological motivation.

⚙️ Technical Details

Problem Definition

Setting: Generative recommendation where a model generates a target item title given a user's interaction history

Inputs: User interaction history prompt x_u

Outputs: Target item title i_t (selected from a fixed corpus of N items)

Pipeline Flow

Prompt Construction (History -> Text)
LLM Generation (Constrained)
Output Verification (Item Validity)

System Modules

Prompt Constructor

Formats user interaction history into a natural language prompt

Model or implementation: Rule-based template

Generative Recommender

Generates the target item title token-by-token

Model or implementation: Qwen2-0.5B (Base or SFT)

Modeling

Base Model: Qwen2-0.5B

Training Method: Reinforced Preference Optimization (ReRe), based on GRPO

Objective Functions:

Purpose: Maximize expected reward using importance sampling and clipping (GRPO).

Formally: L = E[ min( (pi/pi_old) * A, clip(...) * A ) - beta * KL ]
Purpose: Compute Advantage by normalizing rewards within a sampled group.

Formally: A_k = (r_k - mean(r)) / std(r)
Purpose: Calculate Total Reward per sample combining correctness and ranking quality.

Formally: r(e_k, e_t) = R_rule(e_k, e_t) + lambda * R_ranking(e_k) where R_ranking penalizes high-probability negatives

Adaptation: Full fine-tuning

Training Data:

Constructed from Amazon Reviews (Toys, Industrial) and Yelp datasets

Key Hyperparameters:

learning_rate: 1e-5
batch_size: 512
beta: 1e-3
+ 2 more
group_size_G: 16
epochs: 2

Compute: 8 NVIDIA H20 GPUs

Reproducibility

Code: https://github.com/sober-clever/ReRe

Code is publicly available at https://github.com/sober-clever/ReRe. Datasets are standard public benchmarks (Amazon, Yelp). Implementation details (hyperparameters, GPUs) are explicitly provided.

📊 Experiments & Results

Evaluation Setup

Next-item prediction (sequential recommendation)

Benchmarks:

Amazon Toys and Games (Sequential Recommendation)
Amazon Industrial and Scientific (Sequential Recommendation)
Yelp (Sequential Recommendation)

Metrics:

HR@K (Hit Ratio)
NDCG@K (Normalized Discounted Cumulative Gain)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Analysis of reward hacking with different reward types (Semantic, Collaborative).

Comparison of Reward Margins in S-DPO vs ReRe.

Main Takeaways

ReRe consistently outperforms traditional (SASRec) and LLM-based (BigRec, TIGER, D3, S-DPO) recommenders across all datasets, showing the value of on-policy sampling and verifiable rewards.
Constrained Beam Search is superior to dynamic or common sampling for RL in recommendation, as it ensures diversity and validity in the constrained output space.
The proposed Ranking Reward (penalizing high-probability negatives) is more effective than dense proxy rewards like Semantic or Collaborative rewards, which are prone to reward hacking (improving reward scores but degrading ranking metrics).
Performance gains are robust across different backbone scales and families, and ReRe works well with both Base and SFT model initializations.

📚 Prerequisite Knowledge

Prerequisites

Generative Recommendation
Reinforcement Learning (RL)
Large Language Models (LLMs)

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—RL methods that use objective, checkable correctness signals (like math answers or valid code) rather than learned reward models

GRPO: Group Relative Preference Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of samples generated from the same prompt

SFT: Supervised Fine-Tuning—training on correct input-output pairs before RL alignment

DPO: Direct Preference Optimization—an offline method aligning models to preferences using static pairs of chosen/rejected responses

Beam Search: A search algorithm that explores a graph by expanding the most promising node in a limited set

Constrained Decoding: Forcing the language model to generate only tokens that form valid item titles from a predefined corpus

Hard Negatives: Incorrect items that the model assigns high probability to; distinguishing these from the target is crucial for ranking performance