Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs

📝 Paper Summary

Agentic RAG pipeline

AutoRefine trains LLMs via reinforcement learning to autonomously search, explicitly refine noisy documents into key facts, and reason over that refined knowledge to answer complex questions.

Core Problem

Existing retrieval-augmented reasoning methods often reason directly over raw, noisy documents, getting distracted by irrelevant details, and lack direct supervision for improving the retrieval process itself.

Why it matters:

Distractions in early reasoning steps can derail the entire chain in multi-hop scenarios (e.g., confusing two people with similar names).
Outcome-only rewards (final answer correctness) provide insufficient signal for the model to learn *how* to search effectively or filter information.
LLMs struggle with out-of-scope questions requiring precise, up-to-date factual details without explicit refinement.

Concrete Example: When asking 'Who is the subject of the painting The Umbrellas?', a standard model might retrieve a long document about the painting's style and get distracted. AutoRefine explicitly extracts 'The Umbrellas... depicts... Pierre-Auguste Renoir's' in a refinement step, ignoring the noise to answer correctly.

Key Novelty

Search-and-Refine-during-Think Paradigm

Introduces an explicit '<refine>' step between search and reasoning, forcing the model to distill key facts from noisy documents before using them.
Uses Group Relative Policy Optimization (GRPO) with a dual reward system: one for the final answer and a specific 'retrieval reward' that validates the quality of the refined text.
Allows the model to autonomously determine when to search, refine, and answer, rather than following a fixed chain.

Architecture

The AutoRefine framework training and inference process.

Evaluation Highlights

+6.9% higher average accuracy over leading baselines (Search-R1, Search-o1) across seven QA benchmarks.
+8.3% accuracy improvement on the 2WikiMultihopQA benchmark compared to the strongest baseline.
Search quality (retrieving documents containing the answer) surpasses baselines by 10-15% on multi-hop tasks.

Breakthrough Assessment

8/10

Significant performance gains on complex multi-hop tasks by successfully integrating an explicit refinement step into the RL reasoning loop. Effectively addresses the noise problem in RAG.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Reasoning where an agent interacts with a search engine to generate a reasoning trajectory

Inputs: Question q and access to external search engine E

Outputs: Reasoning trajectory o ending in final answer o_ans

Pipeline Flow

Policy Generation (Think/Search/Refine)
Environment Interaction (Search Engine)
Reward Computation (Answer + Refine Quality)
Optimization (GRPO)

System Modules

Actor LLM

Generates the trajectory including thoughts, search queries, refinement blocks, and final answers

Model or implementation: Qwen2.5-3B-Base or Qwen2.5-3B-Instruct

Search Engine

Returns external documents based on model queries

Model or implementation: E5-base-v2 (Retriever) on Wikipedia dump

Reward Model

Calculates rewards for RL updates

Model or implementation: Rule-based functions

Novel Architectural Elements

Explicit '<refine>' action token/state in the reasoning Markov Decision Process (MDP) specifically for distilling retrieval results.
Hybrid reward structure providing dense supervision on the intermediate '<refine>' block content in addition to sparse final answer rewards.

Modeling

Base Model: Qwen2.5-3B (Base and Instruct variants) and Qwen2.5-7B

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward relative to group average.

Formally: E[min(ratio * A, clip(ratio, 1-e, 1+e) * A) - beta * D_KL]
Purpose: Provide feedback on final answer correctness.

Formally: R_Ans = F1(o_ans, a)
Purpose: Provide feedback on intermediate information extraction.

Formally: R_Ret = 1 if ground_truth in o_refine else 0
Purpose: Combine rewards.

Formally: R_overall = R_Ans + 0.1 * R_Ret * (1 - R_Ans)

Adaptation: Full model update (implied, no LoRA mentioned)

Trainable Parameters: Full parameters of Qwen2.5-3B/7B

Training Data:

Combined training set from Natural Questions (NQ) and HotpotQA

Key Hyperparameters:

group_size: Not reported in the paper
clipping_ratio_epsilon: Not reported in the paper
kl_coefficient_beta: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1: AutoRefine adds explicit refinement steps and retrieval-specific rewards, whereas Search-R1 relies only on outcome rewards.
vs. Search-o1: AutoRefine uses RL post-training to learn search/refine policies, rather than just prompting or inference-time logic.
vs. RAT [not cited in paper]: RAT (Retrieval Augmented Thoughts) uses CoT with retrieval but typically lacks the specific RL optimization on the refinement step itself.

Limitations

Requires ground truth answers to compute the retrieval reward (checking if answer is in refinement).
Training relies on a combined set of NQ and HotpotQA; generalization to other domains not fully explored.
Computational cost of RL training is likely higher than SFT due to trajectory sampling.

Reproducibility

Code: https://github.com/syr-cn/AutoRefine

Code is available at https://github.com/syr-cn/AutoRefine. December 2018 Wikipedia dump used as knowledge source. E5-base-v2 used as retriever. Hyperparameters like learning rate and batch size are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Open-domain QA using Wikipedia as external knowledge.

Benchmarks:

Natural Questions (NQ) (Single-hop QA)
TriviaQA (Single-hop QA)
PopQA (Single-hop QA)
HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
Musique (Multi-hop QA)
Bamboogle (Multi-hop QA)

Metrics:

Exact Match (EM)
F1 score
Cover Exact Match (CEM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AutoRefine consistently outperforms baselines on overall average accuracy across 7 benchmarks.
Average (7 datasets)	Accuracy	0.455	0.524	+0.069
Average (7 datasets)	Accuracy	0.548	0.608	+0.060
Performance gains are most significant on multi-hop benchmarks.
2WikiMultihopQA	Accuracy	0.395	0.478	+0.083
Musique	Accuracy	0.169	0.214	+0.045
Ablation studies confirm the necessity of both retrieval rewards and the refinement step.
Average (7 datasets)	Accuracy	0.505	0.524	+0.019
Average (7 datasets)	Accuracy	0.455	0.524	+0.069

Experiment Figures

Analysis of search frequency and search quality (success rate) over training steps.

Success rates of different actions and robustness to retrieval depth k.

Main Takeaways

AutoRefine adapts search frequency dynamically: ~1.2 searches for single-hop vs ~2.5 for multi-hop questions.
The refinement step significantly compresses context (100-200 tokens vs >600 tokens for raw docs) while retaining answer-relevant information.
Robustness to retrieval noise: AutoRefine maintains performance gains even as the number of retrieved documents increases (k=1 to k=7), unlike baselines that saturate or degrade.
External summarizers (like BART) cannot replace the RL-trained refinement step; AutoRefine learns to plan and identify missing info, not just summarize.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) for LLMs
Retrieval-Augmented Generation (RAG)
Proximal Policy Optimization (PPO) variants

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same input, removing the need for a separate value network critic.

Refinement Step: A specific action where the model extracts and summarizes relevant information from retrieved documents into a concise format before reasoning.

Retrieval-Specific Reward: A reward signal given if the ground-truth answer strings appear within the model's refinement block, encouraging accurate information extraction.

Outcome-Based Reward: A reward signal based on the F1 score overlap between the final generated answer and the ground truth.

Search-during-think: A paradigm where models generate 'thought' tokens that can include calls to external search tools.

Multi-hop QA: Question answering tasks that require finding and connecting multiple pieces of evidence (e.g., distinct facts) to derive the final answer.

SFT: Supervised Fine-Tuning—training the model on labeled examples before applying RL.

Cover Exact Match: A metric measuring whether the generated text (document, refinement, or answer) contains the ground truth answer string.