Improving Search Agent with One Line of Code

📝 Paper Summary

Agentic RAG pipeline Reinforcement Learning for Agents

SAPO stabilizes search agent training by adding a conditional KL penalty that selectively targets positive tokens with low probabilities, preventing the policy from drifting too far from the reference during updates.

Core Problem

Standard Group Relative Policy Optimization (GRPO) suffers from Importance Sampling Distribution Drift (ISDD), where the current policy suppresses correct intermediate steps that had high probability in the old policy.

Why it matters:

Catastrophic model collapse occurs even with hard clipping because gradients vanish when importance sampling ratios drop near zero
Search agents require long reasoning chains where intermediate positive actions are sparse and easily suppressed by negative advantages in early training
Existing methods like PPO clipping fail to correct distribution shifts when the policy assigns negligible probability to valid actions found by the old policy

Concrete Example: In GRPO, a response with a correct final answer might have 'incorrect' intermediate steps according to the current policy. If the current policy updates to suppress these intermediate steps (making their probability near zero), the importance sampling ratio vanishes, killing the gradient signal and preventing the model from learning the valid path.

Key Novelty

Search Agent Policy Optimization (SAPO)

Introduces a conditional KL penalty that activates only for tokens with positive advantages (good outcomes) that have drifted excessively (low importance sampling ratio)
Acts as a soft trust region constraint specifically for 'positive tokens,' preventing the model from forgetting successful exploration paths found by the old policy
Implementation requires only a single line of code modification to the standard GRPO loss function

Architecture

Training dynamics of Search-R1 showing the collapse problem: IS ratios dropping to zero, clip ratios spiking, and reward deteriorating.

Evaluation Highlights

+10.6% absolute accuracy improvement (+31.5% relative) over Search-R1 baseline across seven QA benchmarks
+14.7% absolute accuracy improvement on multi-hop QA benchmarks compared to Search-R1
Scales effectively from 1.5B to 14B parameters, achieving 0.495 average EM accuracy with Qwen2.5-14B

Breakthrough Assessment

8/10

Simple, highly effective fix for a fundamental instability in agent training (ISDD). Large empirical gains (+31.5% relative) and immediate deployability make it highly significant for the field.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn question answering where an agent generates reasoning steps, search queries, and final answers interleaved with retrieval

Inputs: User question q

Outputs: Final answer a, generated after a sequence of reasoning steps z and retrieved knowledge k

Pipeline Flow

Input Question → Search Agent (LLM) → [Reasoning Step + Search Query]
Search Query → Retriever → [Documents]
Agent conditions on [Question + Previous Context + Documents] → Next Step
Repeat until Final Answer

System Modules

Search Agent

Generate reasoning thoughts, search queries, and final answers

Model or implementation: Qwen2.5 (1.5B to 14B) or LLaMA-3.2 (3B)

Retriever

Fetch relevant documents for generated queries

Model or implementation: E5-base-v2

Modeling

Base Model: Qwen2.5-Instruct series (1.5B, 3B, 7B, 14B) and LLaMA-3.2-3B (Base/Instruct)

Training Method: Search Agent Policy Optimization (SAPO), a modification of GRPO

Objective Functions:

Purpose: Optimize policy to maximize reward while staying close to old policy.

Formally: Maximize E[min(r*A, clip(r)*A) - β * Conditional_KL]
Purpose: Conditional KL Penalty.

Formally: β * I(A > 0) * I(r < τ) * (r * log(r) - r + 1), penalizing divergence only for positive advantage tokens with low importance ratios

Training Data:

Composite dataset of Natural Questions (NQ) and HotpotQA for training

Key Hyperparameters:

KL_penalty_coefficient_gamma: 0.1
IS_ratio_threshold_tau: 1.0
max_search_turns: 5
+ 1 more
rollout_responses: 5

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1: Adds conditional KL penalty to GRPO loss to prevent model collapse
vs. PPO_clip [not cited in paper]: SAPO targets specific low-probability positive tokens rather than applying a symmetric clip everywhere
vs. TRPO [not cited in paper]: Uses a soft penalty term rather than a hard constraint optimization, making it computationally cheaper

Limitations

Relies on outcome-based rewards (F1 score), which may be sparse or noisy for complex reasoning
Only evaluated on QA benchmarks; applicability to other agent tasks (coding, web browsing) not tested
Threshold sensitivity: performance degrades if IS threshold deviates significantly from 1.0

Reproducibility

Code availability is not provided in the text. Key hyperparameters (gamma=0.1, tau=1.0) are specified. Implementation is described as a 'one-line code modification' to GRPO loss.

📊 Experiments & Results

Evaluation Setup

Single-hop and Multi-hop Question Answering with retrieval

Benchmarks:

Natural Questions (NQ) (Single-hop QA)
TriviaQA (Single-hop QA)
PopQA (Single-hop QA)
HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
Musique (Multi-hop QA)
Bamboogle (Multi-hop QA)

Metrics:

Exact Match (EM) accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison showing SAPO outperforms the Search-R1 baseline across all benchmarks using Qwen2.5-3B.
Average (7 datasets)	EM	0.336	0.442	+0.106
HotpotQA	EM	0.415	0.457	+0.042
Bamboogle	EM	0.368	0.432	+0.064
Ablation studies validating the specific design of the conditional KL penalty.
Average (7 datasets)	EM	0.388	0.429	+0.041
Average (7 datasets)	EM	0.398	0.429	+0.031

Experiment Figures

Scalability analysis (EM/F1 vs Model Size) and Sensitivity analysis (EM vs Threshold τ)

Main Takeaways

SAPO achieves consistent gains across varying model scales (1.5B to 14B), adhering to scaling laws.
The method is model-agnostic, improving both Qwen and LLaMA (Base and Instruct) families.
Multi-hop tasks benefit most significantly, likely because longer trajectories are more prone to ISDD which SAPO corrects.
Threshold parameter τ=1.0 is critical; performance degrades if set significantly higher or lower.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Importance Sampling
KL Divergence
Agentic RAG / Tool use

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards within a group of outputs for the same input, eliminating the need for a critic model

ISDD: Importance Sampling Distribution Drift—a phenomenon where the current policy deviates so far from the old policy that importance sampling weights vanish, zeroing out gradients

SAPO: Search Agent Policy Optimization—the proposed method adding a conditional KL penalty to GRPO to fix ISDD

positive tokens: Tokens that belong to a trajectory with a positive advantage value (i.e., better than the group average)

hard clipping: The standard PPO mechanism that clips the importance ratio to [1-ε, 1+ε] to limit update size, which the authors argue is insufficient for ISDD

External retrieval tokens: Tokens representing the content returned by the search tool, which are masked during training so the agent isn't penalized for tool outputs

Exact Match (EM): Evaluation metric checking if the generated answer string exactly matches the ground truth

F1 score: Reward metric measuring overlap between prediction and ground truth, used here as the outcome-based reward signal