Process vs. outcome reward: Which is better for agenticRAGreinforcement learning

📝 Paper Summary

Agentic RAG pipeline Reinforcement Learning for RAG

ReasonRAG improves agentic retrieval-augmented generation by using Monte Carlo Tree Search to construct a high-quality process-level reward dataset, enabling the model to learn optimal reasoning steps via process-supervised Direct Preference Optimization.

Core Problem

Existing agentic RAG systems relying on outcome-supervised RL (rewarding only the final answer) suffer from sparse rewards, low exploration efficiency, and gradient conflicts when errors occur late in the reasoning chain.

Why it matters:

Outcome-based rewards fail to identify exactly which step (query generation, evidence extraction, or reasoning) caused an error, leading to inefficient learning
Current methods require massive amounts of training data (e.g., 90k samples for Search-R1) to converge due to sparse feedback signals
Annotating high-quality process-level steps manually for RAG is prohibitively expensive due to the complexity of search and reasoning tasks

Concrete Example: In a multi-hop question, if a model performs a correct search but fails to extract the key evidence, outcome-based RL penalizes the entire chain (including the correct search). ReasonRAG's process supervision rewards the correct search step while correcting the extraction step.

Key Novelty

ReasonRAG (Process-Supervised Agentic RAG)

Uses Monte Carlo Tree Search (MCTS) to explore diverse reasoning paths (querying, reading, answering) and identifies high-quality trajectories automatically
Introduces Shortest Path Reward Estimation (SPRE) to assign rewards to intermediate steps, favoring correct answers reached via the most efficient path
Constructs RAG-ProGuide, a dataset of 13k process-level preference pairs, to train the model via Direct Preference Optimization (DPO) rather than just outcome-based RL

Architecture

The ReasonRAG framework pipeline, divided into Data Construction (using MCTS and SPRE) and Inference (Agentic RAG workflow)

Evaluation Highlights

Outperforms Search-R1 on HotpotQA (F1 score) by +2.7% (55.5 vs 52.8) despite using 18x fewer training instances (5k vs 90k)
Achieves higher F1 scores than GPT-4o on 2WikiMultihopQA (53.7 vs 44.5) using a 7B parameter model
Demonstrates 35.8% win rate over Search-R1 in pairwise comparisons, with only a 17.0% loss rate

Breakthrough Assessment

8/10

Significantly improves data efficiency for training agentic RAG systems (5k vs 90k samples) by successfully automating process-level reward annotation, addressing a major bottleneck in RL-based RAG.

⚙️ Technical Details

Problem Definition

Setting: Agentic RAG where a model interacts with a search engine to answer questions requiring multi-step reasoning

Inputs: Natural language question x

Outputs: Final answer y, generated after a sequence of intermediate actions (queries, evidence extraction)

Pipeline Flow

Input Question
Reasoning State (LLM decides to Search or Answer)
Action Execution (Search -> Grounding -> Reasoning OR Generate Answer)
MCTS Exploration (During Training Data Generation)
DPO Training (Policy Optimization)

System Modules

Policy Model

Orchestrates the entire process: decides whether to query or answer, generates queries, extracts evidence, and synthesizes final answers

Model or implementation: Qwen2.5-7B-Instruct

Retriever

Executes search queries generated by the policy model

Model or implementation: DuckDuckGo Search API (as implied by 'search engine')

SPRE Reward Estimator

Calculates rewards for intermediate steps during MCTS to guide dataset creation

Model or implementation: Algorithmic function (not a neural network)

Novel Architectural Elements

Shortest Path Reward Estimation (SPRE) integration into MCTS for automatic process-level label generation in RAG

Modeling

Base Model: Qwen2.5-7B-Instruct

Training Method: Direct Preference Optimization (DPO) with Process Supervision

Objective Functions:

Purpose: Optimize the policy to prefer high-reward process steps over low-reward ones.

Formally: DPO loss L_DPO = -E [log σ (β log (π_θ(y_w|x)/π_ref(y_w|x)) - β log (π_θ(y_l|x)/π_ref(y_l|x)))]

Training Data:

RAG-ProGuide Dataset: Constructed using MCTS on 3,000 questions each from PopQA, HotpotQA, and 2WikiMultihopQA
Filtered to 4,603 questions and 13,289 distinct process-level preference pairs
Pairs selected based on reward difference > 0.01 and distinct sequences

Key Hyperparameters:

learning_rate: 5e-7
batch_size: 64
beta: 0.1 (DPO parameter)
+ 4 more
epoch: 2
max_length: 4096
warmup_ratio: 0.1
scheduler: cosine

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1: ReasonRAG uses process-level rewards via MCTS/SPRE vs. Search-R1's outcome-only rewards; ReasonRAG requires 5k training samples vs. Search-R1's 90k
vs. Self-RAG: ReasonRAG optimizes the entire decision process via implicit preference learning (DPO) rather than explicit critique tokens
vs. ReAct [not cited in paper]: ReAct prompts for reasoning and acting but typically lacks the learned process-level preference optimization of ReasonRAG

Limitations

Process reward annotation relies on MCTS simulations, which can be computationally expensive during the data generation phase
Performance depends on the quality of the base model's ability to perform rollouts during MCTS
Experiments primarily focus on Qwen2.5-7B; scaling laws for larger or smaller models are not explicitly explored

Reproducibility

Code: https://github.com/Applied-Machine-Learning-Lab/ReasonRAG

Code is publicly available at https://github.com/Applied-Machine-Learning-Lab/ReasonRAG. The RAG-ProGuide dataset is introduced and presumably available with the code. The base model Qwen2.5-7B-Instruct is open weights.

📊 Experiments & Results

Evaluation Setup

Open-domain question answering requiring retrieval and multi-step reasoning

Benchmarks:

HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
PopQA (Entity-centric QA (Single-hop))
StrategyQA (Reasoning QA)
TriviaQA (Reading Comprehension)

Metrics:

F1 Score
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against state-of-the-art baselines showing ReasonRAG's superior performance across multiple datasets, particularly in multi-hop scenarios.
HotpotQA	F1	52.8	55.5	+2.7
2WikiMultihopQA	F1	46.2	53.7	+7.5
PopQA	F1	49.6	50.5	+0.9
2WikiMultihopQA	F1	44.5	53.7	+9.2
Ablation studies demonstrate the critical contribution of the process-supervised data (RAG-ProGuide) compared to other training strategies.
HotpotQA	F1	53.4	55.5	+2.1

Experiment Figures

Win/Tie/Loss rates of ReasonRAG compared to Search-R1 and GPT-4o

Performance curves across different training data scales

Main Takeaways

Process-supervised RL is significantly more data-efficient than outcome-supervised RL for Agentic RAG (achieving better results with 5k vs 90k samples)
The Shortest Path Reward Estimation (SPRE) successfully balances correctness and efficiency, preventing the model from taking unnecessarily long reasoning paths
MCTS is an effective strategy for automatically constructing high-quality process-level reward datasets without human annotation
ReasonRAG consistently outperforms strong baselines (Search-R1, Self-RAG, GPT-4o) across varying levels of reasoning complexity (single-hop to multi-hop)

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals
Retrieval-Augmented Generation (RAG) workflows
Monte Carlo Tree Search (MCTS)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Agentic RAG: A RAG system where the LLM autonomously decides when to search, what to query, and when to stop, rather than following a fixed pipeline

Process-supervised RL: Reinforcement learning that provides feedback at each intermediate step of reasoning, rather than just for the final result

MCTS: Monte Carlo Tree Search—a search algorithm that builds a decision tree by randomly simulating future outcomes to find optimal moves

DPO: Direct Preference Optimization—a method that aligns language models to preferences by optimizing on paired examples (winner vs. loser) without a separate reward model

SPRE: Shortest Path Reward Estimation—a novel reward function in this paper that favors trajectories yielding correct answers in fewer steps

Outcome-supervised RL: RL where the model only receives a reward signal (positive/negative) after generating the complete final answer

Rollout: A simulation in MCTS where the model continues generating from a specific state to the end to estimate the value of that state

UCB: Upper Confidence Bound—a formula used in search algorithms to balance exploring new uncertain paths vs. exploiting known good paths

F1 score: A metric measuring the overlap between the predicted answer and the ground truth answer