National University of Singapore,
University of Science and Technology of China,
The Chinese University of Hong Kong
arXiv
(2025)
RecommendationRLReasoningP13N
📝 Paper Summary
LLM-based RecommendationLatent Reasoning
LatentR3 trains Large Language Models to reason via compact latent vectors rather than text, using reinforcement learning with a perplexity-based reward to eliminate the need for explicit reasoning data.
Core Problem
Applying Chain-of-Thought (CoT) to recommendation is difficult because high-quality reasoning data is unavailable (user feedback is implicit) and generating explicit textual reasoning causes prohibitive inference latency.
Why it matters:
Explicit CoT generation is too slow for real-time recommendation systems, creating a bottleneck for deployment
Obtaining 'ground truth' reasoning for user preferences is nearly impossible, making supervised fine-tuning of reasoning capabilities unreliable
Current methods rely on distilling CoT data, which bottlenecks performance on the quality of the teacher's reasoning
Concrete Example:A standard CoT approach would require an LLM to generate a long paragraph explaining why a user likes 'Inception' before recommending 'Interstellar', doubling inference time. LatentR3 generates a single compact vector representing this thought process instantly.
Key Novelty
Reinforced Latent Reasoning (LatentR3)
Replaces textual reasoning steps with 'latent thoughts'—continuous vectors generated by a special attention layer—that are information-dense and efficient
Trains reasoning via Reinforcement Learning using item perplexity as a reward, bypassing the need for explicit Chain-of-Thought supervision
Adapts Group Relative Policy Optimization (GRPO) to continuous space using Gaussian reparameterization for sampling and batch-relative advantage
Architecture
The LatentR3 framework, illustrating the Latent Reasoning Architecture and the Two-Stage Reinforced Learning strategy.
Breakthrough Assessment
8/10
Proposes a significant shift from textual CoT to latent reasoning in RecSys, addressing key latency and data bottlenecks. The adaptation of GRPO to continuous latent spaces without supervision is methodologically strong.
⚙️ Technical Details
Problem Definition
Setting: Next-item recommendation formulated as a generative task
Inputs: User historical interactions converted into a textual prompt x
Outputs: Predicted next item y (and intermediate latent reasoning r)
Pipeline Flow
Input Processing (History to Prompt)
Latent Reasoning Generation (LatentRATT)
Final Generation (Next Item Prediction)
System Modules
Input Encoder
Converts user history h into textual prompt x
Model or implementation: LLM (frozen during RL)
LatentRATT
Generates sequence of latent reasoning tokens (continuous vectors) autoregressively
Model or implementation: Additional Attention Layer (Trainable)
Prediction Head
Predicts the next item based on prompt and latent thoughts
Model or implementation: LLM (frozen during RL)
Novel Architectural Elements
LatentRATT: An explicit attention layer added on top of the LLM decoding layer to generate latent reasoning tokens aligned with the input embedding space
Continuous reasoning pipeline: The system generates continuous vectors (thoughts) that are fed back as inputs for the final prediction, rather than discrete text
Modeling
Base Model: Large Language Model (Specific architecture not detailed in snippet, generally applicable)
Code is publicly available at https://github.com/xuwenxinedu/R3. The paper describes the full training algorithm (LR-GRPO) and reward mechanism.
📊 Experiments & Results
Evaluation Setup
Next-item recommendation where historical interactions are prompts and the target is the next item title.
Metrics:
Performance metrics not reported in snippet (likely Recall/NDCG)
Statistical methodology: Not reported in the paper
Main Takeaways
The paper claims LatentR3 enables effective latent reasoning without direct supervision.
The method eliminates the inference latency overhead associated with explicit Chain-of-Thought.
The proposed LR-GRPO algorithm stabilizes RL training in continuous spaces using batch-relative advantages.
📚 Prerequisite Knowledge
Prerequisites
Reinforcement Learning (specifically PPO/GRPO)
Large Language Models (LLMs) for Recommendation
Latent Variable Models
Key Terms
CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps before the final answer
Latent Reasoning: Reasoning performed in the model's internal hidden state space (vectors) rather than via visible text generation
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from group averages to reduce variance without a separate value network
Reparameterization Trick: A mathematical technique (often using Gaussian noise) that allows gradients to backpropagate through random sampling steps
SFT: Supervised Fine-Tuning—training a model on labeled examples using standard cross-entropy loss
Perplexity (PPL): A metric measuring how uncertain a model is about a prediction; lower perplexity indicates higher confidence
LatentRATT: The specific attention layer proposed in this paper that generates latent reasoning tokens on top of the LLM