Beyond Outcome Reward: Decoupling Search and Answering Improves LLM Agents

📝 Paper Summary

Agentic RAG pipeline RL-based tool use

DeSA improves search agents by decoupling training into two stages: first maximizing retrieval recall to learn search skills, then optimizing exact-match answer accuracy.

Core Problem

Training search agents solely with outcome-based rewards (like final answer accuracy) fails to teach effective intermediate search behaviors, leading to inefficiencies.

Why it matters:

Outcome-only supervision provides sparse, delayed feedback, causing credit assignment challenges where agents don't learn which specific actions led to success
Ineffective search behaviors (e.g., duplicate queries, invalid tool calls) persist even when final answer accuracy improves slightly, capping overall potential
Current single-stage RL methods assume optimizing final answers implicitly optimizes search, but empirical analysis proves this assumption false

Concrete Example: A Qwen2.5-3B agent trained only on outcome rewards might skip searching entirely (relying on parametric memory) or issue the same query multiple times, wasting resources. In contrast, DeSA forces the agent to first demonstrate it can find the relevant documents before trying to answer.

Key Novelty

DeSA (Decoupling Search and Answering)

Two-stage RL framework: Stage 1 strictly rewards finding relevant documents (Recall Reward) to establish search competence
Stage 2 switches to outcome-based rewards (Exact Match) to refine answer generation, initializing from the competent searcher developed in Stage 1
Prevents the 'reward hacking' observed in single-stage methods where agents learn to answer without searching or develop degenerate search patterns

Architecture

The two-stage training process of DeSA

Evaluation Highlights

+8.0% average score improvement over Search-R1 baseline using Qwen2.5-3B-Instruct across 7 QA benchmarks
+11.5 absolute point improvement on the Bamboogle benchmark (0.347 vs 0.232) using Qwen2.5-3B-Instruct
Reduces deficient search rate from 23.36% (Search-R1) to 6.96% (DeSA) on Qwen2.5-3B-Instruct

Breakthrough Assessment

8/10

Strong empirical evidence debunking the common assumption that outcome rewards suffice for tool-use. Simple, effective two-stage solution with significant gains.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision-making for Question Answering where an agent interacts with a search engine environment to gather information before answering.

Inputs: User query q

Outputs: Final answer a

Pipeline Flow

Agent receives query q
Loop: Agent generates Action (Search or Answer)
If Search: Environment returns top-k documents; History updated
If Answer: Process terminates

System Modules

LLM Policy

Generates thoughts and actions (search queries or final answers) based on history

Model or implementation: Qwen2.5-3B-Instruct / Qwen2.5-7B-Instruct

Search Engine

Retrieves documents given a query

Model or implementation: E5 retriever + 2018 Wikipedia corpus

Novel Architectural Elements

Sequential two-stage optimization pipeline where the objective function changes entirely between stages (Recall → EM) to enforce skill acquisition order

Modeling

Base Model: Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Stage 1 Search Optimization.

Formally: R_recall = 1 if answer in retrieved_info else 0
Purpose: Stage 2 Outcome Optimization.

Formally: R_EM = 1 if Normalized(answer) in GroundTruth else 0

Training Data:

Natural Questions (NQ) training split
HotpotQA training split

Key Hyperparameters:

retrieved_passages_k: 3
knowledge_corpus: 2018 Wikipedia

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1: DeSA separates search learning (recall reward) from answer learning (EM reward), whereas Search-R1 optimizes EM directly
vs. IRCoT: DeSA optimizes the agent via RL to learn when/how to search, rather than relying on prompting
vs. Simple Linear Combination (Recall + EM) [not cited in paper]: DeSA uses sequential stages rather than a weighted sum reward, avoiding conflicting gradients

Limitations

Requires ground truth answers for rewards (cannot easily apply to open-ended tasks without verifiers)
Relies on a static Wikipedia corpus (2018), limiting testing on real-time web search scenarios
Analysis focused on relatively small models (3B/7B), scale-up behavior to 70B+ not tested

Reproducibility

Code: https://github.com/yiding-w/DeSA

Code and artifacts available at https://github.com/yiding-w/DeSA. Uses standard datasets (NQ, HotpotQA) and standard metrics (EM). Specifics of computing infrastructure (GPU hours) not detailed.

📊 Experiments & Results

Evaluation Setup

Search-augmented QA on 7 benchmarks using Wikipedia

Benchmarks:

NaturalQuestions (General QA)
TriviaQA (General QA)
PopQA (General QA)
HotpotQA (Multi-Hop QA)
2WikiMultiHopQA (Multi-Hop QA)
Musique (Multi-Hop QA)
Bamboogle (Multi-Hop QA)

Metrics:

Exact Match (EM) Accuracy
Search Recall
Deficient Search Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing DeSA improvements over single-stage Search-R1 across model sizes.
Average (7 datasets)	Average Score	0.396	0.418	+0.022
Average (7 datasets)	Average Score	0.336	0.363	+0.027
Bamboogle	Score	0.232	0.347	+0.115
Behavioral analysis showing reduction in problematic search patterns.
Average (7 datasets)	Deficient Search Rate (%)	23.36	6.96	-16.4
Average (7 datasets)	Search Recall (%)	59.50	64.50	+5.00
Ablation studying different Stage 1 reward signals.
Average (7 datasets)	Score	0.363	0.354	-0.009

Experiment Figures

Comparison of Search Recall and Answer Accuracy between trajectories with 'Deficient' vs 'Effective' search behaviors

Training curves showing Recall and EM scores during Stage 1 training to determine transition point

Main Takeaways

Outcome-only rewards (EM) lead to 'deficient search behaviors' (skipping search, duplicates) because the feedback is too sparse and delayed
Decoupling training into Recall Optimization (Stage 1) followed by Answer Optimization (Stage 2) yields superior results compared to single-stage or mixed-reward training
The 'Recall' reward is the most robust signal for Stage 1; adding complex penalties or using fine-grained accuracy signals often degrades final performance
DeSA not only improves accuracy but significantly increases search recall and reduces invalid/redundant tool calls

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) basics
Retrieval-Augmented Generation (RAG) concepts
Understanding of Policy Optimization

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies using group-based relative advantages without a separate value network

Exact Match (EM): A metric and reward signal checking if the generated answer string exactly matches the ground truth (after normalization)

Recall Reward: A reward signal based on whether the retrieved documents contain the necessary information/answer

Deficient Search: Problematic behaviors defined by the authors: No Search (skipping retrieval), Duplicate Queries, or Invalid Searches (malformed syntax)

Credit Assignment: The problem in RL of determining which past action is responsible for a current reward

DeSA: Decoupling Search and Answering—the proposed two-stage training framework

SFT: Supervised Fine-Tuning—training on labeled data, often used as a starting point before RL

E5: A dense retrieval model used to fetch relevant passages based on semantic similarity