ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

📝 Paper Summary

Agentic RAG pipeline RL-based search agents Self-correction mechanisms

ReSeek trains search agents to dynamically self-correct during reasoning using a specialized JUDGE action and dense process rewards, validated on a new contamination-free benchmark called FictionalHot.

Core Problem

Existing search agents often commit to erroneous reasoning paths early on due to sparse rewards and lack mechanisms to recover from unproductive search steps.

Why it matters:

Current RL-based agents rely on final answer correctness, which provides poor feedback for intermediate reasoning steps
Without self-correction, a single misleading query can cause a cascade of errors that the agent cannot reverse
High performance on public benchmarks may reflect data contamination (memorization) rather than genuine reasoning ability

Concrete Example: When asking for the 'creator of Saddle Rash' to find their birth date, standard RAG and Search-R1 retrieve documents about the show but fail to find the creator's name. They cannot pivot. ReSeek retrieves the show info, JUDGEs it as useful but insufficient, then formulates a new query for 'Loren Bouchard birth date' to succeed.

Key Novelty

ReSeek (Self-Correcting Search Agent)

Introduces a <judge> action that allows the agent to pause after retrieval, assess if the info is useful, and decide whether to keep it or discard it before the next step
Uses a dense process reward that decomposes into correctness (factuality) and utility (relevance to query), guiding the agent's step-by-step decision making
Creates FictionalHot, a benchmark of synthetic questions about fictional entities inserted into a closed-world corpus to strictly test reasoning without memorization

Architecture

Comparison of RAG, Search-R1, and ReSeek workflows. Highlights ReSeek's cyclic self-correction process via the Judge action.

Evaluation Highlights

Outperforms SOTA baseline ZeroSearch by +3.1% average accuracy (0.377 vs 0.346) on 8 QA benchmarks using Qwen-2.5-7B-Instruct
Achieves significant gains on multi-hop tasks like HotpotQA and Bamboogle compared to baselines
Demonstrates minimal performance drop on FictionalHot between 3B and 7B models (0.059 vs 0.061), proving the benchmark successfully isolates reasoning from parametric knowledge

Breakthrough Assessment

8/10

Strong contribution in both method (dynamic self-correction with dense rewards) and evaluation (addressing data contamination with FictionalHot). Effectively targets the brittleness of current search agents.

⚙️ Technical Details

Problem Definition

Setting: Multi-step Question Answering with external tool use (Search Engine)

Inputs: Natural language question x from dataset D

Outputs: Trajectory y containing actions, observations, judgments, and final answer

Pipeline Flow

Agent Policy (generates search query)
Tool Execution (Search Engine retrieves documents)
JUDGE Mechanism (evaluates retrieved info)
Context Manager (filters history based on judgment)
Agent Policy (re-plans or answers)

System Modules

Agent Policy

Generates actions (search queries, judge calls, or final answers) based on current context

Model or implementation: Qwen-2.5-3B-Instruct or Qwen-2.5-7B-Instruct

Search Tool

Retrieves external information based on generated queries

Model or implementation: E5 Embeddings (retriever)

Judge Mechanism

Evaluates the utility of observation o_t

Model or implementation: Same LLM as Agent Policy (via <judge> token)

Novel Architectural Elements

Selective Context Assembly: The next action is conditioned on a dynamically assembled context where uninformative observations (judged negative) are filtered out, rather than the full history
Mandatory <judge> action cycle: Enforces an explicit self-assessment checkpoint after every information retrieval step via structured prompting

Modeling

Base Model: Qwen-2.5-7B-Instruct and Qwen-2.5-3B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy against a reference using cumulative rewards.

Formally: Maximize E[R(x, y) - beta * KL(pi_theta || pi_ref)]
Purpose: Guide the self-correction process.

Formally: R_judge(t) = +0.5 if judgment matches ideal, -0.5 otherwise. Ideal judgment is determined by reranker score threshold (>0.7).

Training Data:

Unified training set merging NQ and HotpotQA training splits

Key Hyperparameters:

beta: Not explicitly reported in the paper
retrieval_top_k: 3
max_turns: 4

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1: ReSeek adds dynamic intra-episode self-correction (Judge) and dense process rewards vs. sparse outcome rewards
vs. RAG: ReSeek enables multi-step sequential decision making and error recovery vs. single-step static retrieval
vs. Backtracking Correction [not cited in paper]: ReSeek performs online/intra-episode correction via filtering history, whereas Backtracking often refines post-hoc or via complex state rollback

Limitations

Reliance on a specific reranker (BGE) for reward signals; performance drops with weaker signals (Regex)
Computational overhead of the additional Judge step in the inference loop
Experiments limited to Qwen models; generalization to other LLM families not tested

Reproducibility

Code availability is not provided. Benchmark construction (FictionalHot) is described in detail (GPT-5 paraphrasing of seed questions, insertion into Wiki-18). Experimental setup uses standard public datasets (NQ, HotpotQA, etc.) and open models (Qwen).

📊 Experiments & Results

Evaluation Setup

Open-domain QA with search over 2018 Wikipedia corpus

Benchmarks:

FictionalHot (Multi-hop reasoning with synthetic entities) [New]
HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
Musique (Multi-hop QA)
Bamboogle (Multi-hop QA)
NQ (Single-hop QA)
TriviaQA (Single-hop QA)
PopQA (Single-hop QA)

Metrics:

Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ReSeek consistently outperforms baselines across diverse QA benchmarks, with particularly strong results on multi-hop tasks.
Average (8 datasets)	EM	0.346	0.377	+0.031
Average (8 datasets)	EM	0.281	0.312	+0.031
FictionalHot	EM	0.408	0.061	-0.347
Ablation studies confirm the importance of the neural reranker for the reward signal and the benefits of increasing interaction turns.
Average	EM	0.301	0.312	+0.011

Experiment Figures

Performance trends across interaction turns (1-4) and sensitivity to different retrieval embeddings.

Breakdown of Judge action impact (Positive, Negative, Normal) across 12 benchmark settings.

Main Takeaways

ReSeek improves monotonically with more interaction turns (1 to 4), whereas baselines saturate after 2 turns, proving the effectiveness of the self-correction loop.
The Judge mechanism has a high 'Positive' impact rate (40-50%), where it correctly filters irrelevant info or confirms useful info, vs. <25% 'Negative' impact.
FictionalHot reveals massive data contamination in standard benchmarks; models scoring ~40% on TriviaQA drop to ~0% on FictionalHot (Direct Inference), while ReSeek maintains non-trivial performance via reasoning.
Instruction-tuned backbones consistently outperform base models (+1.8 to +2.3 points) due to better adherence to structured prompts and tool-use conventions.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) for Language Models
Retrieval-Augmented Generation (RAG)
Process Reward Models (PRM)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies based on group-wise relative rewards

FictionalHot: A new benchmark created by replacing real entities in multi-hop questions with fictional ones and inserting synthetic documents into the corpus to test reasoning without data contamination

JUDGE action: A special agent action introduced in ReSeek that triggers a self-evaluation step to assess the utility of retrieved information

Process Reward: A dense reward signal given at intermediate steps of reasoning, rather than just at the final outcome

Reranker: A model component that scores the relevance of retrieved documents to the query; used here to calculate the utility reward

Closed-world evaluation: Testing where the agent can only use provided external knowledge sources (e.g., a fixed Wikipedia dump) and not its internal pre-trained knowledge

Exact Match (EM): A metric that counts a prediction as correct only if it matches the ground truth answer string exactly after normalization

SFT: Supervised Fine-Tuning—training the model on labeled demonstrations before applying RL