Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty

📝 Paper Summary

Agentic RAG pipeline Reinforcement Learning for RAG

Beta-GRPO improves Agentic RAG efficiency by modifying the reinforcement learning reward to only incentivize search actions when the model is confident in its query generation, thereby aligning search behavior with knowledge boundaries.

Core Problem

Current Agentic RAG systems suffer from sub-optimal behaviors: 'over-search' (retrieving information the model already knows) and 'under-search' (hallucinating instead of retrieving necessary information).

Why it matters:

Inefficient searching wastes computational resources and increases latency without improving answer quality
Failing to search when necessary leads to factual errors and hallucinations, degrading system reliability
Existing RL methods reward final correctness but do not explicitly penalize the model for being unsure about its own knowledge boundaries

Concrete Example: For the simple question 'Who was the first president?', a baseline model unnecessarily initiates a search (over-search). Conversely, for an obscure question like 'In what Country is Sul America Esporte Clube in?', the baseline hallucinates an answer (under-search), whereas the proposed model correctly identifies the knowledge gap and searches.

Key Novelty

Beta-GRPO (Confidence-Aware Group Relative Policy Optimization)

Uses the minimum token probability within a generated search query as a proxy for the model's 'confidence' in that search action
Modifies the RL reward function to grant rewards only if the model is both correct AND its search confidence exceeds a threshold (beta)
Forces the agent to learn to search only when it can formulate a high-certainty query, effectively calibrating its knowledge boundaries

Architecture

Analysis pipeline for identifying over-search and under-search behaviors in agentic trajectories

Evaluation Highlights

Achieves 50.55% Average Exact Match (EM) across 7 QA benchmarks, outperforming the base Search-R1-GRPO (48.62%) and R1-Searcher (45.31%)
Reduces under-search rate by ~7.3% (from 42.04% to 34.71%) compared to the Search-R1-GRPO baseline
Reduces over-search rate by ~1.2% (from 21.10% to 19.89%), demonstrating better efficiency in utilizing internal knowledge

Breakthrough Assessment

7/10

Identifies and quantifies a critical efficiency problem in Agentic RAG (over/under-search) and provides a simple, effective RL-based solution. The gains are consistent, though the method is a modification of existing GRPO rather than a new architecture.

⚙️ Technical Details

Problem Definition

Setting: Multi-step Question Answering where an agent can choose to reason internally or retrieve external information

Inputs: Natural language question q

Outputs: Final answer a_f

Pipeline Flow

Group Name: Policy Model Inference -> Action Selection -> Environment Interaction

System Modules

Policy Model

Generate reasoning steps and decide whether to search or answer

Model or implementation: Qwen2.5-3B (initialized from Search-R1)

Retriever

Fetch documents given a search query

Model or implementation: E5 (dense retriever)

Novel Architectural Elements

Reward function modification: Rewards are binary (0 or 1) but conditioned on search token confidence exceeding threshold beta (beta-GRPO)

Modeling

Base Model: Qwen2.5-3B

Training Method: beta-GRPO (Group Relative Policy Optimization with confidence threshold)

Objective Functions:

Purpose: Optimize policy to maximize expected reward based on correctness and search confidence.

Formally: Reward R(T) = 1 if (Answer is Correct AND Confidence C(T) > beta), else 0.
Purpose: Measure confidence of a search trajectory.

Formally: C(T) = min(P(w)) where w are tokens in the search query q_t.

Training Data:

Mixture of NQ (Natural Questions) and HotpotQA training sets

Key Hyperparameters:

beta: 0.4
learning_rate: 1e-6
batch_size: 512
+ 2 more
steps: 200
group_size: 5

Compute: Two A100 GPUs

Comparison to Prior Work

vs. Search-R1: Search-R1 rewards any correct trajectory; beta-GRPO only rewards correct trajectories where the model was 'confident' in its search query formulation
vs. RAG (Standard): beta-GRPO is agentic (dynamic steps) rather than fixed retrieval
vs. Self-RAG [not cited in paper]: Self-RAG uses special tokens for reflection trained via supervised learning; beta-GRPO uses RL with token probabilities as implicit confidence signals

Limitations

Sub-optimal search behaviors (over/under-search) are reduced but not eliminated (still ~20% and ~35% respectively)
Experiments limited to relatively small models (3B parameters) due to compute constraints
Relies on token probability as a proxy for semantic confidence, which may not always be calibrated
Analysis focused on Wikipedia-based QA; application to open-ended 'deep research' tasks remains future work

Reproducibility

Code: https://github.com/mianzhang/Search-R1

Code is publicly available on GitHub. Training uses open datasets (NQ, HotpotQA) and open models (Qwen2.5). Retriever setup (E5, Wiki dump) is standard but requires setup.

📊 Experiments & Results

Evaluation Setup

Multi-hop and General Question Answering with agentic retrieval

Benchmarks:

HotpotQA (Multi-hop QA)
2WikiMultiHopQA (Multi-hop QA)
Bamboogle (Multi-hop QA)
MuSiQue (Multi-hop QA)
NQ (General QA)
TriviaQA (General QA)
PopQA (General QA)

Metrics:

Exact Match (EM)
Over-search Rate
Under-search Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison shows beta-GRPO outperforming standard GRPO and other baselines across average EM scores.
Average (7 datasets)	Exact Match (EM)	48.62	50.55	+1.93
HotpotQA	Exact Match (EM)	51.35	54.12	+2.77
Bamboogle	Exact Match (EM)	46.12	49.80	+3.68
Efficiency analysis measuring reductions in sub-optimal search behaviors.
Multi-hop Datasets	Over-search Rate	21.10	19.89	-1.21
Multi-hop Datasets	Under-search Rate	42.04	34.71	-7.33

Experiment Figures

Training reward curves for Search-R1-GRPO vs. Search-R1-beta-GRPO

Main Takeaways

Incorporating confidence thresholds into RL rewards improves overall accuracy by aligning search decisions with model certainty
Reducing 'under-search' (hallucination) contributes more significantly to performance gains than reducing 'over-search'
A beta threshold of 0.4 was empirically found to be optimal, balancing caution and exploration
Stable reward curves during training suggest that confidence-based rewards provide a cleaner signal than standard correctness-only rewards

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically Policy Gradient methods)
Retrieval-Augmented Generation (RAG)
Language Model probability/logits

Key Terms

Agentic RAG: A system where an LLM autonomously decides when and what to retrieve using tools, rather than following a fixed retrieve-then-generate pipeline

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of outputs generated for the same input to reduce variance

Over-search: When an agent retrieves information for a query it could have answered correctly using only its internal parametric knowledge

Under-search: When an agent fails to retrieve information for a query and subsequently answers incorrectly (hallucinates)

beta-GRPO: The authors' proposed variant of GRPO that incorporates a confidence threshold beta into the reward function

Exact Match (EM): A metric that measures the percentage of predictions that match the ground truth answer exactly