Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

📝 Paper Summary

Agentic RAG pipeline Reinforcement Learning for RAG

ReasonRAG improves agentic RAG by using Monte Carlo Tree Search to generate high-quality process-level supervision data, enabling an LLM to learn efficient multi-step retrieval and reasoning via Direct Preference Optimization.

Core Problem

Existing agentic RAG systems relying on outcome-supervised reinforcement learning suffer from low exploration efficiency, sparse rewards, and gradient conflicts because they only receive feedback after the final answer.

Why it matters:

Outcome-based supervision penalizes entire reasoning chains even if early steps were correct, leading to inefficient learning
High-quality process-level annotation is prohibitively expensive to obtain manually
Current systems struggle with complex multi-step queries requiring dynamic retrieval decisions

Concrete Example: In outcome-based RL (like Search-R1), if a model makes a correct search query but fails the final synthesis, the valid search action is penalized. ReasonRAG provides rewards for the intermediate search step.

Key Novelty

Process-Supervised RL with MCTS-generated Data (ReasonRAG)

Uses Monte Carlo Tree Search (MCTS) to explore possible RAG trajectories (query generation, evidence extraction, answering) and discover high-quality paths
Introduces Shortest Path Reward Estimation (SPRE) to automatically assign rewards to intermediate steps, prioritizing correctness and efficiency (shorter paths)
Compiles these paths into a process-level preference dataset (RAG-ProGuide) to train the policy via DPO

Architecture

The ReasonRAG framework illustrating (a) Process-Supervised Data Construction via MCTS and SPRE, and (b) The Agentic RAG Inference workflow.

Evaluation Highlights

Outperforms Search-R1 on HotpotQA (48.9% vs 47.0% F1) using only 5k training queries compared to Search-R1's 90k
Achieves higher average performance (34.4% EM, 42.3% F1) across 5 benchmarks compared to Search-R1 (32.8% EM, 40.7% F1)
Demonstrates strong out-of-domain generalization, beating baselines on Bamboogle and MuSiQue datasets

Breakthrough Assessment

8/10

Significant for demonstrating that process-level supervision derived from MCTS is far more data-efficient than outcome-based RL for agentic RAG, achieving better results with ~5% of the training data volume.

⚙️ Technical Details

Problem Definition

Setting: Multi-step Question Answering where an agent must dynamically decide to retrieve, extract evidence, or answer

Inputs: Natural language question x

Outputs: Final response y, produced after a sequence of intermediate actions

Pipeline Flow

Reasoning State: LLM decides to generate query or answer
Grounding State: If query generated, Retrieve docs -> Extract Evidence -> Append to Context -> Loop back to Reasoning
Terminal State: Generate Final Answer

System Modules

Policy Model (Reasoning & Generation)

Orchestrates the entire process: decides actions (Query, Answer) and generates content

Model or implementation: Qwen2.5-7B-Instruct

Search Engine/Retriever

External tool to fetch documents based on generated queries

Model or implementation: DuckDuckGo (implied context of agentic RAG, specifically typically web search or dense retriever)

Evidence Extractor (Reasoning & Generation)

Selects relevant spans from retrieved documents

Model or implementation: Qwen2.5-7B-Instruct (same Policy Model)

Novel Architectural Elements

MCTS-guided data construction pipeline tailored for RAG actions (Query Gen, Evidence Extraction, Answer Gen)
Shortest Path Reward Estimation (SPRE) mechanism integrated into MCTS to penalize inefficiency

Modeling

Base Model: Qwen2.5-7B-Instruct

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Optimize policy to favor high-reward reasoning paths over low-reward ones.

Formally: DPO loss minimizing -log σ(β * log(π(y_w|x)/π_ref(y_w|x)) - β * log(π(y_l|x)/π_ref(y_l|x)))

Adaptation: Full fine-tuning (implied by DPO on 7B model)

Training Data:

Source: 3,000 questions each from PopQA, HotpotQA, 2WikiMultihopQA
Method: MCTS exploration with SPRE rewards
Filtering: Prune failed branches; keep pairs with reward gap > 0.01
Final Size: 4,603 questions yielding 13,289 preference pairs

Key Hyperparameters:

beta: Not explicitly reported in the paper (standard DPO parameter)
alpha (SPRE decay): Explicitly mentioned as decay factor in (0, 1]
k (rollouts): Explicitly mentioned as rollout count

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1: Uses process-level rewards (step-by-step) vs outcome-level rewards; significantly higher data efficiency (5k vs 90k queries)
vs. AutoRAG: Focuses on dynamic agentic reasoning during inference rather than static pipeline optimization
vs. Self-RAG [not cited in paper]: Uses MCTS to construct preference data for DPO rather than training critique tokens via supervised learning

Limitations

Dependency on the quality of the reward estimation (SPRE) which relies on final answer correctness rollouts
Computational cost of MCTS during data generation phase is high compared to standard sampling
Evaluation limited to 7B parameter model (Qwen2.5-7B)
Performance gains on single-hop datasets (PopQA) are smaller compared to multi-hop datasets

Reproducibility

Code: https://github.com/Applied-Machine-Learning-Lab/ReasonRAG

Code is publicly available at https://github.com/Applied-Machine-Learning-Lab/ReasonRAG. The dataset RAG-ProGuide is introduced but exact download link is part of the repo. Hyperparameters for DPO (beta, LR) are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Open-domain QA on single-hop and multi-hop datasets using Qwen2.5-7B-Instruct as backbone

Benchmarks:

PopQA (Single-hop QA (Long-tail))
HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
Bamboogle (Multi-hop QA (Out-of-domain))
MuSiQue (Multi-hop QA (Out-of-domain))

Metrics:

Exact Match (EM)
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis against state-of-the-art baselines showing ReasonRAG's superior performance, particularly in multi-hop scenarios.
HotpotQA	F1	47.0	48.9	+1.9
HotpotQA	EM	35.2	38.5	+3.3
2WikiMultihopQA	F1	34.5	39.6	+5.1
MuSiQue	F1	33.7	35.3	+1.6
Average (All 5 datasets)	F1	40.7	42.3	+1.6

Experiment Figures

Comparison of training efficiency between ReasonRAG and Search-R1.

Statistics of the RAG-ProGuide dataset.

Main Takeaways

Process-supervised RL (ReasonRAG) is significantly more data-efficient than outcome-supervised RL (Search-R1), achieving better results with ~95% fewer training queries.
Improvements are most pronounced on multi-hop datasets (HotpotQA, 2Wiki), validating the method's ability to handle complex reasoning chains.
The method generalizes well to out-of-domain datasets (Bamboogle, MuSiQue) that were not seen during the MCTS data construction phase.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
Reinforcement Learning (RL)
Monte Carlo Tree Search (MCTS)
Direct Preference Optimization (DPO)

Key Terms

Agentic RAG: A RAG system where the LLM dynamically controls the retrieval process, rather than following a fixed retrieve-then-generate pipeline

Process-Supervised RL: Reinforcement learning that provides feedback at intermediate steps of a reasoning chain, rather than just the final outcome

Outcome-Supervised RL: Reinforcement learning where the model is rewarded only based on the correctness of the final answer

MCTS: Monte Carlo Tree Search—a search algorithm that balances exploration and exploitation to find optimal decision paths by building a search tree

SPRE: Shortest Path Reward Estimation—a proposed reward function that values correct answers and penalizes longer reasoning chains to encourage efficiency

DPO: Direct Preference Optimization—a method to align language models to preferences without training a separate reward model, using a specific loss function on preference pairs

RAG-ProGuide: The novel dataset created by this paper, containing 13,289 process-level preference pairs derived from MCTS exploration

Gradient Conflict: A phenomenon in outcome-supervised learning where a negative final reward penalizes correct intermediate steps, causing conflicting update signals

UCB: Upper Confidence Bound—a strategy used in MCTS to select nodes that balances the estimated value of a node with the uncertainty of that estimate