← Back to Paper List

R3-rag: Learning step-by-step reasoning and retrieval for llms via reinforcement learning

Y Li, Q Luo, X Li, B Li, Q Cheng, B Wang…
School of Computer Science, Fudan University
arXiv preprint arXiv … (2025)
RAG RL Reasoning QA

📝 Paper Summary

Agentic RAG pipeline Iterative Retrieval
R3-RAG trains LLMs to iteratively interleave reasoning and retrieval using reinforcement learning with a specific process reward for document relevance and an outcome reward for answer correctness.
Core Problem
Dense retrievers lack reasoning capabilities and are often the bottleneck in RAG systems, while existing iterative RAG methods rely on rigid, human-designed workflows that fail to fully exploit LLMs' reasoning potential.
Why it matters:
  • Standard dense retrievers have significantly fewer parameters than LLMs and cannot perform step-by-step thinking needed for complex queries
  • Static or rule-based iterative workflows restrict the model's ability to explore optimal retrieval strategies dynamically
  • LLMs are not natively trained to invoke retrievers iteratively to gather comprehensive evidence, leading to hallucinations when retrieval is insufficient
Concrete Example: For the question 'Which film has the director born first, The Model Couple or Thendral Veesum?', a standard RAG system retrieves only director names but misses birth dates. R3-RAG decomposes this into four queries: finding director A, finding director B, finding birth date A, and finding birth date B, adapting if a query fails.
Key Novelty
Reinforcement Learning for Reasoning and Retrieval (R3-RAG)
  • Trains the LLM to autonomously decide when to reason, when to retrieve, and when to answer using a unified action space
  • Introduces a 'relevance-based document verification' process reward that encourages retrieving useful documents at every step, not just getting the final answer right
  • Combines a cold-start phase (supervised learning on synthetic trajectories) with PPO training to refine the exploration of the retrieval environment
Evaluation Highlights
  • Outperforms the strong iterative baseline IRCoT by ~15 percentage points on average across three multi-hop QA datasets using Llama-3.1-8B
  • Achieves 64.4% accuracy on HotpotQA with Llama-3.1-8B, surpassing standard RAG+CoT (53.3%) and ReAct (30.8%)
  • Maintains consistent performance gains across different retrievers (BM25, E5, BGE), showing strong transferability without retraining the RL component
Breakthrough Assessment
8/10
Significant performance jumps over strong baselines (IRCoT) and demonstrates a successful application of granular process rewards for retrieval actions, addressing a key bottleneck in agentic RAG.
×