ReSearch: Learning to reason with seach for LLMs via RL

📝 Paper Summary

Agentic RAG pipeline Reasoning-based Retrieval

ReSearch trains LLMs to autonomously interleave reasoning and search actions using reinforcement learning without any supervised labels for the reasoning or search steps.

Core Problem

Current multi-step RAG approaches rely on labor-intensive, unscalable manual prompts or heuristics, and collecting supervised labels for complex reasoning-search chains is impractical.

Why it matters:

Real-world questions are often complex and require multiple retrieval steps, which static RAG pipelines fail to address effectively.
Manually designing prompts for every complex scenario is not scalable.
Existing RL approaches focus on internal reasoning (like DeepSeek-R1) but have not fully explored integrating external knowledge retrieval into the reinforcement loop.

Concrete Example: When asking a multi-hop question, a standard model might search once and hallucinate connections. ReSearch, as shown in the case study, searches for a term, realizes 'I made a mistake' when results are irrelevant, reflects, and generates a corrected query to find the answer.

Key Novelty

Reinforcement Learning for Interleaved Reasoning and Search (ReSearch)

treats search operations as integral actions within a chain-of-thought reasoning process, optimized purely via RL (GRPO) based on final answer correctness.
Eliminates the need for supervised 'gold' reasoning chains or search queries; the model self-learns when to search and how to use results through trial and error.
Introduces a specialized rollout format where the model generates <search> tags, pauses for external retrieval, and resumes reasoning with <result> context.

Architecture

Illustration of the ReSearch framework and the GRPO training process.

Evaluation Highlights

Outperforms best baselines by +15.81% (Exact Match) on average across benchmarks using Qwen2.5-7B.
Achieves strong generalization: trained only on MuSiQue but shows consistent gains on HotpotQA, 2WikiMultiHopQA, and Bamboogle.
Surpasses prompt-based methods like IRCoT and Iter-RetGen by margins ranging from 8.9% to 22.4%.

Breakthrough Assessment

8/10

Strong evidence that RL alone can induce sophisticated search behaviors (including self-correction) without supervised process data. Bridges the gap between DeepSeek-R1 style reasoning and RAG.

⚙️ Technical Details

Problem Definition

Setting: Multi-hop Question Answering with external retrieval environment

Inputs: Natural language question x

Outputs: Answer y containing interleaved reasoning steps, search queries, and final answer

Pipeline Flow

Input Question -> Policy Model Generation (Reasoning/Query)
Search Environment (Retrieval)
Policy Model Generation (Reasoning/Answer)

System Modules

Policy Model

Generates text-based thinking, decides when to search, generates search queries, and produces the final answer.

Model or implementation: Qwen2.5-7B/32B (Base and Instruct variants)

Retrieval Environment

Executes search queries generated by the policy model and returns text results.

Model or implementation: E5-base-v2 (Retriever) + Wikipedia Corpus

Novel Architectural Elements

Integration of search actions (<search>) directly into the RL rollout chain, where the model halts generation, calls an external search, and resumes with results concatenated to the context.

Modeling

Base Model: Qwen2.5-7B and Qwen2.5-32B (both Base and Instruct versions)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy to maximize reward while staying close to reference model.

Formally: Maximize E[min(ratio * A, clip(ratio, 1-e, 1+e) * A)] - beta * KL(pi || pi_ref)
Purpose: Reward function.

Formally: Reward = Answer_Reward (F1 score with ground truth) + Format_Reward (adherence to tag structure)

Adaptation: Full model update (implied, as method is GRPO on base model)

Training Data:

Training set of MuSiQue dataset (19,938 samples)
No supervised reasoning chains used; only question-answer pairs.

Key Hyperparameters:

epochs: 2
retrieval_top_k: 5

Compute: Not reported in the paper

Comparison to Prior Work

vs. DeepSeek-R1: ReSearch integrates external actions into the RL chain, whereas DeepSeek-R1 focuses on pure internal reasoning.
vs. Iter-RetGen: ReSearch learns the policy via RL from scratch/feedback, while Iter-RetGen relies on fixed prompt heuristics.
vs. IRCoT: ReSearch uses RL optimization rather than in-context learning/prompting.

Limitations

Training relies solely on the MuSiQue dataset, though generalization is shown.
Reward signal is sparse (final answer only), which might be inefficient for extremely long chains.
Computational cost of RL training with search rollouts is likely high (implied by method, though not quantified).

Reproducibility

Prompt templates provided in paper. Built on 'verl' and 'FlashRAG' libraries. Data uses standard benchmarks (MuSiQue, HotpotQA, etc.). Code URL not provided.

📊 Experiments & Results

Evaluation Setup

Open-domain QA using Wikipedia (Dec 2018 dump) as the knowledge base.

Benchmarks:

HotpotQA (Multi-hop QA)
2WikiMultiHopQA (Multi-hop QA)
MuSiQue (Multi-hop QA)
Bamboogle (2-hop QA (Search-engine resistant))

Metrics:

Exact Match (EM)
LLM-as-a-judge (LJ) score (using GPT-4o-mini)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison results for Qwen2.5-7B models across multiple benchmarks.
HotpotQA	Exact Match (EM)	41.69	57.77	+16.08
2WikiMultiHopQA	Exact Match (EM)	39.69	56.40	+16.71
MuSiQue	Exact Match (EM)	26.31	48.66	+22.35
Bamboogle	Exact Match (EM)	49.60	63.20	+13.60
Main comparison results for Qwen2.5-32B models.
HotpotQA	Exact Match (EM)	54.60	69.10	+14.50

Experiment Figures

Trends of response length and number of search operations during training.

Training and Validation reward curves.

Main Takeaways

ReSearch significantly outperforms naive RAG, Iter-RetGen, and IRCoT across all tested benchmarks and model sizes.
The model demonstrates strong generalization; despite being trained only on MuSiQue, it achieves state-of-the-art results on HotpotQA, 2WikiMultiHopQA, and Bamboogle.
Instruction-tuned models consistently serve as better starting points for RL training compared to base models.
The model naturally evolves to use more search operations and longer reasoning chains during training without explicit supervision on these behaviors.
Qualitative analysis reveals emergent self-correction behaviors, where the model recognizes poor search results and reformulates queries autonomously.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically GRPO)
Retrieval-Augmented Generation (RAG)
Chain-of-Thought (CoT) reasoning

Key Terms

RAG: Retrieval-Augmented Generation—systems that retrieve external documents to ground LLM answers.

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs for the same input, removing the need for a separate critic model.

ReSearch: The proposed framework: Learning to Reason with Search.

Rollout: A single complete generation sequence produced by the model during RL training, including thinking, search queries, and results.

Exact Match: A metric checking if the generated answer string exactly matches the ground truth string.

LLM-as-a-judge: Using a strong LLM (like GPT-4) to evaluate the correctness of an answer, often used when answers are open-ended.

MuSiQue: A multi-hop QA dataset requiring complex reasoning chains to answer.

HotpotQA: A dataset with questions requiring reasoning over multiple supporting documents.

2WikiMultiHopQA: A multi-hop QA dataset constructed from Wikipedia.

Bamboogle: A manually constructed dataset of 2-hop questions designed to be difficult for search engines.

IRCoT: Interleaving Retrieval and Chain-of-Thought—a prompt-based baseline method.