DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

📝 Paper Summary

Agentic RAG pipeline Process Supervision

DecEx-RAG models retrieval-augmented generation as a Markov Decision Process that decouples decision-making from execution, using a pruning strategy to efficiently construct process-supervision data for training.

Core Problem

Current outcome-supervised RL methods for Agentic RAG suffer from inefficient exploration, sparse rewards, and ambiguous feedback that fails to optimize intermediate steps like sub-question generation or retrieval decisions.

Why it matters:

Sparse reward signals in outcome-based RL require excessive data and training steps to converge
Global rewards (final answer correctness) cannot distinguish which specific step (retrieval vs. reasoning) caused a failure
Inefficient exploration in search trees leads to exponential computational costs when generating training data for multi-step reasoning

Concrete Example: In Search-R1, a model might correctly retrieve information about a director's nationality but still output the wrong answer 'No' due to reward hacking, where the reasoning process contradicts the conclusion. DecEx-RAG ensures the reasoning path aligns with the final answer.

Key Novelty

Decoupled Decision and Execution MDP with Pruning (DecEx-RAG)

Models RAG as an MDP with two distinct stages: 'Decision' (whether to stop or retrieve) and 'Execution' (generating the actual sub-question or query), allowing fine-grained optimization of both efficiency and quality
Introduces a 'Pruning Search' strategy during data generation that simulates outcomes (rollouts) for intermediate steps, pruning redundant branches to keep data expansion linear rather than exponential

Architecture

The DecEx-RAG framework showing the MDP process of search tree expansion, pruning, and training data construction.

Evaluation Highlights

+6.3% average improvement over the outcome-supervised baseline Search-R1 across six QA datasets
Pruning strategy speeds up data construction by ~6x compared to no-pruning expansion while maintaining equivalent model performance
Outperforms Search-R1 by +8.6% on HotpotQA and +11.8% on 2WikiMultiHopQA (F1 score)

Breakthrough Assessment

8/10

Significant data efficiency gains (6x speedup) and consistent performance improvements over strong RL baselines like Search-R1. The explicit decoupling of decision and execution in MDP offers a cleaner framework for Agentic RAG.

⚙️ Technical Details

Problem Definition

Setting: Multi-step Question Answering modeled as a Markov Decision Process (MDP)

Inputs: Natural language question Q

Outputs: Final answer o via a sequence of states s_t containing sub-questions and retrieval results

Pipeline Flow

State Initialization: Start with question Q
Decision-Making: Decide whether to Terminate (generate final answer) or Continue (generate sub-question)
Retrieval Decision: If continuing, decide whether to use Internal Knowledge (generate answer directly) or Retrieve (generate query + search)
Execution: Execute the chosen action (generate text or retrieve docs)
State Update: Append results to history and repeat

System Modules

Policy Model

Generates decisions (terminate/continue, retrieve/internal) and execution content (sub-questions, answers)

Model or implementation: Qwen2.5-7B-Instruct (base) / Qwen3-30B-A3B (expansion)

Retriever

Fetches external documents when the policy decides to retrieve

Model or implementation: E5 (frozen)

Novel Architectural Elements

Decoupled MDP structure: Distinct 'Decision' (method selection) vs. 'Execution' (content generation) steps
Distinction between 'sub-question' (logical decomposition) and 'sub-query' (search engine input) as separate optimized steps

Modeling

Base Model: Qwen2.5-7B-Instruct

Training Method: Two-stage training: SFT on optimal reasoning paths followed by DPO on decision/execution preference pairs

Objective Functions:

Purpose: SFT loss to learn correct reasoning paths.

Formally: Standard cross-entropy loss on optimal paths extracted from search tree.
Purpose: DPO loss to align model with preferred decisions and executions.

Formally: Standard DPO loss maximizing margin between preferred (high reward) and rejected (low reward) branches.

Training Data:

HotpotQA (2,000 samples) and 2WikiMultiHopQA (1,000 samples) training subsets
Search tree expansion with pruning generates process-supervised data

Key Hyperparameters:

max_iteration_limit_Tmax: 4
retrieved_documents_k: 3
temperature: Non-zero for sampling decisions (implied)
+ 1 more
threshold: Preset threshold for skipping retrieval (value not explicitly stated)

Compute: Search tree expansion efficiency: Pruning Search is ~6x faster than No Pruning Search. Full Node Search takes >1 hour per question (impractical).

Comparison to Prior Work

vs. Search-R1: Uses process supervision (step-level rewards) via MDP instead of outcome supervision (final reward only)
vs. DeepRAG: Optimizes both decision AND execution (content quality), whereas DeepRAG focuses only on decision-making
vs. ReasonRAG: DecEx-RAG demonstrates better retrieval behavior and less hallucination; ReasonRAG tends to over-rely on internal knowledge

Limitations

Requires ground truth for reward calculation during data construction (cannot easily apply to open-ended tasks without verifiers)
Pruning strategy relies on the assumption that local optimality in sub-steps aligns with global optimality (though 85% overlap observed)
Training data scale is relatively small (3,000 samples total)

Reproducibility

Code: https://github.com/sdsxdxl/DecEx-RAG

Code is publicly available at https://github.com/sdsxdxl/DecEx-RAG. Uses open-source Qwen models and standard datasets (HotpotQA, etc.). Prompt instructions provided in Appendix A.1.

📊 Experiments & Results

Evaluation Setup

Open-domain QA (single-hop and multi-hop) using Wikipedia as knowledge source

Benchmarks:

HotpotQA (Multi-hop QA (in-domain))
2WikiMultiHopQA (Multi-hop QA (in-domain))
Bamboogle (Multi-hop QA (out-of-domain))
PopQA (Single-hop QA (out-of-domain))
NQ (Single-hop QA (out-of-domain))
AmbigQA (Single-hop QA (out-of-domain))

Metrics:

Exact Match (EM)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
DecEx-RAG significantly outperforms outcome-supervised and prompt-based baselines across all six datasets.
HotpotQA	F1	54.7	63.3	+8.6
2WikiMultiHopQA	F1	46.2	58.0	+11.8
PopQA	F1	52.7	55.7	+3.0
Average (6 datasets)	F1	Not explicitly reported as a single average number in text, but inferred from individual deltas	Not explicitly reported as a single average number in text	+6.3
Ablation studies confirm the necessity of both SFT and DPO stages.
HotpotQA	F1	58.1	63.3	+5.2
HotpotQA	F1	39.6	63.3	+23.7

Experiment Figures

Ablation studies on data selection strategies (Most vs Least retrieval cost) and preference data composition (Decision vs Execution data).

Main Takeaways

Process supervision via decoupled MDP yields significantly higher data efficiency than outcome supervision (Search-R1 needs more data to converge).
Pruning strategy is highly effective: retains 85% of optimal reasoning paths while reducing data construction time by nearly 6x.
Models trained with 'Most Retrieval Cost' strategy (encouraging verification) outperform those trained to minimize retrieval, indicating deliberation improves accuracy.
Combining SFT (learning basic patterns) and DPO (optimizing decisions) is critical; neither works well in isolation.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDP formulation)
Retrieval-Augmented Generation (RAG)
Process Supervision vs. Outcome Supervision
Direct Preference Optimization (DPO)

Key Terms

MDP: Markov Decision Process—a mathematical framework for modeling decision-making where outcomes are partly random and partly under the control of a decision maker

Process Supervision: Training models using feedback on intermediate reasoning steps rather than just the final result

Outcome Supervision: Training models based solely on whether the final answer is correct

DPO: Direct Preference Optimization—a method to align language models to preferences without training a separate reward model

SFT: Supervised Fine-Tuning—training the model on high-quality examples before applying RL

Rollout: Simulating the completion of a task from a certain state to estimate the future reward

Pruning: Removing unpromising branches in a search tree to save computation

Search-R1: A strong baseline method using outcome-supervised reinforcement learning for RAG

F1 score: A metric measuring the overlap between the predicted answer and the ground truth

EM: Exact Match—a metric requiring the predicted answer to be identical to the ground truth

HotpotQA: A dataset for multi-hop question answering requiring reasoning over multiple documents