O $^ 2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

📝 Paper Summary

Agentic RAG pipeline RL-based

O2-Searcher is a reinforcement learning-based agent that decouples reasoning from knowledge by interacting with a local search environment, optimizing for both open-ended exploration and closed-ended accuracy.

Core Problem

LLMs rely on static parametric knowledge that quickly becomes obsolete and lack the ability to effectively handle open-ended questions which require extensive, multi-turn exploration and non-unique answers.

Why it matters:

Current search agents primarily address closed-ended problems with clear objectives, neglecting open-ended tasks that require comprehensive, multi-aspect responses.
Relying on static model weights leads to hallucinations and factual inaccuracies when dealing with dynamic, real-time, or specialized information.
Using live web search for training is slow and costly, hindering the large-scale reinforcement learning needed to teach agents complex search behaviors.

Concrete Example: For an open-ended question like 'What are the impacts of the metaverse on education?', a standard model might give a generic, outdated summary. O2-Searcher actively queries 'metaverse applications in education', 'virtual classrooms pros cons', and 'future trends', synthesizing diverse findings into a structured report.

Key Novelty

RL-based Search Agent with Dual-Mode Optimization (O2-Searcher)

Decouples internal reasoning from external knowledge by training the agent to master search interaction skills (finding, understanding, applying) rather than memorizing facts.
Uses a unified training mechanism with distinct reward functions for closed-ended (accuracy) and open-ended (diversity, format, factuality) questions to handle both types flexibly.
Introduces a locally simulated search environment that caches web pages and Wikipedia, enabling rapid, low-cost reinforcement learning interactions compared to live API calls.

Architecture

The inference workflow of O2-Searcher involving iterative thought, search action generation, local retrieval, and answer synthesis.

Evaluation Highlights

Significantly outperforms SOTA agents (Search-R1, Perplexity-Pro) on the newly constructed O2-QA benchmark for open-ended questions using only a 3B model.
Achieves SOTA results on closed-ended benchmarks (NQ, HotpotQA) among similarly-sized models, matching performance of larger 7B models.
Demonstrates effective generalization by maintaining high performance on both deterministic fact-seeking and exploratory open-ended tasks.

Breakthrough Assessment

8/10

Strong contribution in addressing the underexplored area of open-ended search agents via RL. The efficient local environment and specialized reward design for open-endedness are significant practical advancements.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering handling both open-ended and closed-ended queries via sequential interaction with a search environment.

Inputs: Natural language query q_0

Outputs: Final predicted answer a_pred (either exact answer or structured key findings)

Pipeline Flow

Input Query -> [Internal Knowledge Assessment] -> [Action Decision (Search vs. Answer)]
If Search: [Query Generation] -> [Local Environment Retrieval] -> [Information Condensation] -> [Update Knowledge State] -> Loop
If Answer: [Final Answer Generation]

System Modules

Agent (Actor)

Decides whether to search or answer, generates queries, and synthesizes findings.

Model or implementation: 3B-base LLM (initialized via Cold Start)

Local Search Environment

Simulates web search by retrieving relevant cached pages.

Model or implementation: MeiliSearch (for open-ended) / E5-based Dense Retriever (for closed-ended)

Condensation Module

Compresses retrieved raw content into structured learnings to fit context window.

Model or implementation: Commercial LLM (prompt-based)

Novel Architectural Elements

Locally simulated search environment combining heterogeneous sources (Web cache for open-ended, Wikipedia for closed-ended) to enable efficient RL exploration.
Unified agent policy handling distinct answer formats (exact vs. key findings) through type-aware reward functions.

Modeling

Base Model: 3B parameter model (specific architecture not named, likely Llama or similar class)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Optimize policy relative to a group of outputs without a critic model.

Formally: Maximizing the advantage of outputs with higher rewards within a group sampled from the old policy.
Purpose: Encourage diverse query generation.

Formally: Diversity reward r_{o,div} based on cosine dissimilarity of query embeddings.
Purpose: Ensure correct formatting.

Formally: Format reward r_{o,fm} penalizing incorrect tags or list structures.
Purpose: Ensure factual accuracy for open-ended answers.

Formally: Factual reward r_{o,f1} using Hungarian matching between generated and ground-truth items based on embedding similarity.

Adaptation: Full fine-tuning (implied via RL on base model)

Trainable Parameters: Full model parameters (implied)

Training Data:

Cold start data: Curated CoT trajectories (open-ended via prompt-engineered agent, closed-ended via Search-R1 on NQ/HotpotQA subsets).
RL training data: Mixed open-ended and closed-ended queries.

Key Hyperparameters:

group_size_G: Not explicitly reported in the paper
learning_rate: Not explicitly reported in the paper
beta_KL: Not explicitly reported in the paper

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Search-R1: O2-Searcher explicitly targets and optimizes for *open-ended* questions with diverse/long-form answers, whereas Search-R1 focuses on deterministic answers.
vs. Perplexity-Pro: O2-Searcher uses a significantly smaller model (3B) and local RL training to achieve comparable/superior performance on specific benchmarks.
vs. WebGPT [not cited in paper]: WebGPT uses human feedback (RLHF) for browsing; O2-Searcher uses automated reward functions based on ground-truth matching and formatting constraints.

Limitations

Dependency on the quality of the local cache; the agent cannot access the live web outside the pre-cached corpus during training.
Reward functions for open-ended generation rely on embedding similarity to 'ground truth' findings, which may not capture all valid novel insights.
Evaluation on open-ended tasks is inherently difficult; O2-QA is manually curated but limited in size (300 questions).

Reproducibility

Code: https://github.com/Acade-Mate/O2-Searcher

Code is publicly available at https://github.com/Acade-Mate/O2-Searcher. The O2-QA benchmark (300 questions + 30k cached pages) is also released. Specific hyperparameters like learning rate and batch size are not detailed in the text provided.

📊 Experiments & Results

Evaluation Setup

Mixed evaluation on open-ended (O2-QA) and closed-ended (NQ, HotpotQA) benchmarks using a local search environment.

Benchmarks:

O2-QA (Open-domain Open-ended QA) [New]
Natural Questions (NQ) (Closed-ended QA)
HotpotQA (Multi-hop QA)

Metrics:

Accuracy (for closed-ended)
F1 score (for open-ended key findings match)
Format compliance
Diversity of findings
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
O2-Searcher performance on the newly constructed O2-QA benchmark against SOTA agents.
O2-QA	Performance Score (Qualitative claim)	Not reported in the paper	Not reported in the paper	Not reported in the paper
Performance on standard closed-ended benchmarks showing competitiveness with larger models.
Closed-ended QA Benchmarks	SOTA status	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

O2-Searcher effectively bridges the gap between closed-ended and open-ended search tasks using a unified RL framework.
The 3B model, when trained with specialized rewards and a local environment, can outperform or match larger models on specific retrieval-intensive tasks.
Decoupling knowledge (via search) from reasoning allows the model to handle dynamic open-domain information better than static parameters alone.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically PPO/GRPO)
Retrieval-Augmented Generation (RAG)
Language Model Prompting (Chain-of-Thought)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes policies by comparing a group of outputs generated from the same input, reducing the need for a separate value function.

O2-QA: A benchmark constructed by the authors containing 300 manually curated open-ended questions with associated web page caches.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

Cold Start: An initial supervised fine-tuning phase using high-quality demonstration data to stabilize the model before reinforcement learning begins.

MeiliSearch: An open-source search engine used here to index and retrieve cached web content efficiently.

SOTA: State-of-the-Art—the current best performance achievable by existing methods.

Dense Retriever: A retrieval system that uses vector embeddings (rather than keyword matching) to find relevant documents.

Hungarian Algorithm: An optimization algorithm used here to match generated findings to ground truth findings for calculating factual rewards in open-ended tasks.