The Hong Kong Polytechnic University,
National University of Singapore,
University of Science and Technology of China,
Harbin Institute of Technology (Shenzhen)
arXiv
(2025)
RecommendationReasoningRLP13N
📝 Paper Summary
LLM-based RecommendationReasoning in Recommender Systems
R2ec integrates reasoning and recommendation into a single LLM with a dual-head architecture, using a reinforcement learning framework to jointly optimize reasoning chains and efficient item prediction without human annotations.
Core Problem
Current approaches decouple reasoning from recommendation (requiring two separate models) or rely on slow autoregressive decoding of item IDs, leading to high latency and suboptimal disjoint optimization.
Why it matters:
Running separate reasoning and recommendation models doubles resource costs and latency
Alternate optimization (freezing one module to train the other) prevents true end-to-end alignment of reasoning rationales with ranking objectives
Autoregressive generation of item identifiers in large models is computationally expensive compared to direct prediction
Concrete Example:A standard reasoning recommender might first use a large LLM to generate a user profile analysis, then pass that text to a separate BERT-based ranker to score items. This requires maintaining two heavy models in memory and prevents the ranker's errors from directly updating the LLM's reasoning logic during training.
Key Novelty
Unified Dual-Head Large Recommender (R2ec)
Equips a decoder-only LLM with two heads: a language head for generating reasoning chains and a recommendation head for single-step item prediction
Uses the RecPO framework to train without reasoning annotations by sampling diverse reasoning paths and optimizing them via a fused reward (ranking + similarity)
Architecture
The dual-head architecture of R2ec and the inference flow.
Breakthrough Assessment
8/10
Proposes a unified architecture that solves the latency bottleneck of reasoning recommenders while enabling end-to-end RL training. Strong theoretical contribution in the RecPO framework.
⚙️ Technical Details
Problem Definition
Setting: Sequential recommendation with intrinsic reasoning generation
Inputs: Tokenized user interaction history and instruction prompt x_u
Outputs: A sequence of reasoning tokens o_1:T followed by a recommended item v
Pipeline Flow
Input Processing: User History -> Prompt
Reasoning Generation: Backbone -> LM Head (Autoregressive)
Recommendation: Final Hidden State -> Rec Head (Single-step)
System Modules
Input Processor
Converts user interaction history into a natural language prompt
Model or implementation: Tokenizer
LLM Backbone (Reasoning Generation)
Encodes input and generates hidden states for reasoning
Model or implementation: Decoder-only Transformer
Language Head (lm_head) (Reasoning Generation)
Autoregressively generates the chain of reasoning tokens
Model or implementation: Linear layer mapping hidden state to vocabulary size
Recommendation Head (rec_head)
Predicts the next item based on the final state of the reasoning chain
Model or implementation: Item embedding table dot-product
Novel Architectural Elements
Dual-head output mechanism where reasoning (text) and recommendation (item embeddings) share the same semantic hidden space but use distinct projection heads
Reasoning-then-item sequential dependency enforced structurally: the recommendation head activates only after the reasoning chain is complete
Modeling
Base Model: Decoder-only Transformer (Specific checkpoint not named in text)
Training Method: RecPO (Reinforcement Learning)
Objective Functions:
Purpose: Jointly optimize reasoning and recommendation.
Formally: L(theta) = - E [ sum( l_epsilon(r_t, A) ) ] where recommendation updates only use the trajectory with the highest advantage i*
Purpose: Assign quality scores to sampled trajectories.
Formally: Reward R = beta * R_similarity + R_ranking (NDCG)
Key Hyperparameters:
reward_weight_beta: approx 0.05
sampling_strategy: Top-K sampling with temperature
Compute: Not reported in the paper
Comparison to Prior Work
vs. Generative Recommenders: R2ec uses a specific recommendation head for single-step prediction instead of slow token-by-token ID generation
vs. Reasoning-augmented pipelines: R2ec is a single unified model rather than separate reasoning and ranking modules
vs. Standard RL (GRPO/RLOO): R2ec uses a domain-specific fused reward (discrete ranking + continuous similarity) and a modified update rule that filters recommendation gradients to the best reasoning path
Limitations
Relies on reinforcement learning which can be unstable without careful hyperparameter tuning
Training requires sampling multiple trajectories per input, increasing computational cost during the training phase
Reasoning is inherently subjective; the model learns 'reasoning' that maximizes reward, which may not always align with human-readable logic
Code and checkpoints available at https://github.com/YRYangang/RRec. The text describes the full RL pipeline (RecPO) and reward formulation.
📊 Experiments & Results
Evaluation Setup
Sequential recommendation on three datasets
Benchmarks:
Not explicitly named in text (Sequential Item Recommendation)
Metrics:
NDCG
Inference Latency
Reasoning Quality (Qualitative)
Statistical methodology: Not explicitly reported in the paper
Main Takeaways
R2ec significantly outperforms traditional, LLM-based, and reasoning-augmented baselines across three datasets.
The dual-head architecture significantly reduces inference latency compared to autoregressive generative recommenders and multi-module reasoning pipelines.
The RecPO framework successfully enables the model to learn context-aware reasoning strategies (contextual understanding) without human-annotated reasoning data.
Ablation studies confirm the necessity of the fused reward scheme; using ranking metrics alone is insufficient for dense signal.
📚 Prerequisite Knowledge
Prerequisites
Transformer architecture (Decoder-only)
Reinforcement Learning (Policy Gradients, PPO)
LLM-based Recommendation paradigms (Encoder vs. Generative)
Key Terms
RecPO: Reinforcement Learning for Recommendation Preference Optimization—the proposed training framework that optimizes reasoning without human annotations
Dual-head architecture: A model design with two output layers sharing a backbone: one for generating text (reasoning) and one for scoring items (recommendation)
NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that accounts for the position of relevant items
GRPO: Group Relative Policy Optimization—an RL baseline method for LLMs that estimates advantages from group scores
RLOO: Reinforcement Learning with Leave-One-Out—an RL baseline method for advantage estimation
PPO: Proximal Policy Optimization—an RL algorithm that limits policy updates to a trust region to ensure stability
Softmax similarity: A continuous reward signal based on the dot-product similarity between the predicted item embedding and the target item embedding