Rec-R1: Bridging Generative Large Language Models and User-Centric Recommendation Systems via Reinforcement Learning

📝 Paper Summary

LLM for Recommendation (LLM4Rec) Reinforcement Learning for Recommendation

Rec-R1 optimizes LLMs for recommendation tasks via reinforcement learning using direct feedback from downstream black-box recommenders, bypassing the need for supervised fine-tuning data.

Core Problem

LLMs in recommendation systems are typically frozen or fine-tuned on proxy tasks (like mimicking GPT-4), creating a disconnect between the generation objective and actual recommendation performance.

Why it matters:

Supervised Fine-Tuning (SFT) is constrained by the quality of teacher models (e.g., GPT-4o), imposing a performance ceiling
Proxy objectives (like next-token prediction) do not align with downstream metrics like NDCG or Recall
Generating high-quality SFT data is expensive and time-consuming, often requiring human annotation or commercial APIs

Concrete Example: In product search, an SFT model might rewrite a query to sound natural (mimicking GPT-4), but the rewritten query might fail to retrieve relevant items because the downstream retriever (e.g., BM25) responds better to keyword-heavy queries. Rec-R1 learns to generate the keyword-heavy query because it optimizes for the retrieval score directly.

Key Novelty

Closed-loop RL for RecSys-LLM Alignment

Treats the LLM as a policy that generates inputs (rewritten queries, profiles, rankings) for a fixed, black-box recommendation system
Uses the recommendation system's output metrics (NDCG, Recall) directly as reward signals for reinforcement learning
Optimizes the LLM to maximize these rewards via Group Relative Policy Optimization (GRPO), aligning generation with recommendation utility rather than linguistic plausibility

Architecture

Comparison of Prompting, SFT, and Rec-R1 paradigms. It highlights the closed loop in Rec-R1.

Evaluation Highlights

+21.45 NDCG@100 improvement on ESCI (Video Games) using Rec-R1 with a BM25 retriever compared to the base BM25 baseline
+18.76 NDCG@100 improvement on ESCI (Video Games) using Rec-R1 with a BLAIR dense retriever compared to the base BLAIR baseline
Preserves instruction-following capabilities (maintaining IFEval scores) while SFT causes a ~27-point drop, demonstrating prevention of catastrophic forgetting

Breakthrough Assessment

8/10

Strong conceptual advance by replacing SFT proxy objectives with direct RL optimization on black-box feedback. The empirical gains are very large (>20%), and it addresses the fundamental 'alignment' problem in LLM4Rec.

⚙️ Technical Details

Problem Definition

Setting: Conditional text generation for downstream black-box utility optimization

Inputs: Recommendation-relevant input s (e.g., user query, interaction history)

Outputs: Textual action a (e.g., rewritten query, user profile, ranked list) to be consumed by the RecSys

Pipeline Flow

Generator (LLM) produces text action a from input s
Recommender (Black Box) consumes a to retrieve/rank items
Evaluator calculates metric (Reward) based on ground truth
RL Optimizer updates Generator policy

System Modules

Generator (LLM)

Generates the intermediate text (rewritten query, profile, or ranking) to control the recommender

Model or implementation: Qwen-2.5-3B-Instruct

Recommender

Executes the retrieval or ranking task using the generated text

Model or implementation: Varies (BM25, BLAIR, or Rec-R1-Retriever)

Evaluator

Computes the reward scalar based on the quality of the Recommender's output

Model or implementation: Deterministic Metric Function

Novel Architectural Elements

Direct integration of non-differentiable recommender metrics (NDCG/Recall) as RL rewards for LLM training
Closed-loop feedback mechanism where the LLM adapts to the specific quirks of the downstream recommender (e.g., adapting to BM25 vs. Dense Retriever)

Modeling

Base Model: Qwen-2.5-3B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward (recommendation metric) while keeping the policy stable.

Formally: E [ f(a|s) ] where f is the recommender metric.

Trainable Parameters: Full LLM parameters

Key Hyperparameters:

learning_rate: 2e-6 (Product Search), 5e-6 (Sequential/Rerank)
batch_size: Not explicitly reported in the paper
beta (KL penalty): 0.04 (Search), 0.01 (Seq/Rerank)
+ 1 more
group_size: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. LLM-Rec/TallRec: Rec-R1 optimizes directly for downstream metrics via RL, avoiding the imitation ceiling of SFT.
vs. Prompting (GPT-4o): Rec-R1 updates parameters to align with the specific retriever, often outperforming much larger frozen models.
vs. Direct SFT (on GPT-4 data): Rec-R1 does not require expensive data synthesis and prevents catastrophic forgetting typical of SFT.

Limitations

Dependency on high-quality relevance labels (ground truth dictionary D) for reward calculation
Sequential recommendation performance in transductive settings lags behind specialized non-LLM baselines (SASRec)
Computational cost of RL training is generally higher than SFT (though GRPO mitigates memory usage)

Reproducibility

Code: https://github.com/linjc16/Rec-R1

Code and data are available at https://github.com/linjc16/Rec-R1. The paper uses public datasets (ESCI, Amazon-C4, Amazon Beauty). Baseline implementations (BLAIR, Qwen) are open source.

📊 Experiments & Results

Evaluation Setup

Three tasks: Product Search (Query Rewriting), Sequential Recommendation (User Profiling), Product Re-ranking

Benchmarks:

ESCI (Product Search / Re-ranking)
Amazon-C4 (Complex Product Search)
Amazon Beauty (Sequential Recommendation)
IFEval (Instruction Following (Generalization))

Metrics:

NDCG@100 (Search)
Recall@10 (Sequential)
NDCG@10 (Re-ranking)
Strict/Loose Accuracy (IFEval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Product Search (Query Rewriting): Rec-R1 significantly boosts retrieval performance compared to original queries and prompting baselines across both sparse and dense retrievers.
ESCI (Video Games)	NDCG@100	0.4571	0.6716	+0.2145
ESCI (Video Games)	NDCG@100	0.5058	0.6934	+0.1876
Sequential Recommendation: Rec-R1 excels in the Inductive (cold-start) setting compared to traditional SRec models.
Amazon Beauty (Inductive)	Recall@10	0.0658	0.1009	+0.0351
Amazon Beauty (Inductive)	NDCG@10	0.0345	0.0594	+0.0249
Product Re-ranking: Rec-R1 outperforms both dedicated cross-encoders and larger LLM rerankers.
ESCI (Video Games)	NDCG@10	0.5513	0.7428	+0.1915
ESCI (Video Games)	NDCG@10	0.7241	0.7428	+0.0187
Generalization Capabilities: Rec-R1 maintains general instruction following abilities unlike SFT.
IFEval	Strict Accuracy	29.76	57.88	+28.12

Experiment Figures

Proof-of-concept on ESCI comparing SFT vs GPT-4o performance and cost.

Main Takeaways

Rec-R1 consistently outperforms both Zero-shot/Few-shot prompting and Supervised Fine-Tuning across multiple recommendation tasks.
The framework is retriever-agnostic, boosting performance for both sparse (BM25) and dense (BLAIR) systems.
In sequential recommendation, Rec-R1 is particularly strong in cold-start (Inductive) settings where traditional ID-based models fail.
Unlike SFT, which suffers from catastrophic forgetting (drastic drop in IFEval), Rec-R1 preserves the general capabilities of the base LLM.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Information Retrieval metrics (NDCG, Recall)
Large Language Models (SFT vs. RLHF)

Key Terms

SFT: Supervised Fine-Tuning—training a model to mimic a dataset of inputs and targets (e.g., mimicking GPT-4's query rewrites)

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from a group of outputs for the same input, reducing the need for a separate value network

NDCG: Normalized Discounted Cumulative Gain—a measure of ranking quality that considers the position of relevant items in a list

BM25: Best Matching 25—a standard bag-of-words retrieval function that ranks documents based on term frequency and inverse document frequency

Cold-start: A scenario where the system has little or no prior interaction data for a user or item

Transductive setting: Evaluation where test items were seen during training

Inductive setting: Evaluation where test items were NOT seen during training (testing generalization)

KL divergence: Kullback-Leibler divergence—a statistical distance measuring how one probability distribution differs from another

BLAIR: A dense retrieval model used as a baseline and backbone in the experiments

Cross-encoder: A ranking model that processes query and document simultaneously to output a relevance score (accurate but slow)

PPO: Proximal Policy Optimization—a popular reinforcement learning algorithm (mentioned as a contrast to GRPO)

IFEval: Instruction Following Evaluation—a benchmark measuring an LLM's ability to follow explicit constraints and formatting instructions