Open Deep Search: Democratizing Search with Open-source Reasoning Agents

📝 Paper Summary

Agentic RAG pipeline Search AI

Open Deep Search (ODS) augments open-source LLMs with a sophisticated search tool and reasoning agents (ReAct or CodeAct) to outperform proprietary search engines like Perplexity and GPT-4o.

Core Problem

State-of-the-art Search AI solutions (Perplexity, GPT-4o Search) are closed-source, while existing open-source alternatives primarily pass raw search results to LLMs without sufficient reasoning or processing.

Why it matters:

Closed-source solutions limit transparency, innovation, and community development in Search AI
Proprietary models dominate benchmarks, creating a gap between accessible open-source tools and commercial performance
Simple retrieval-augmented generation often fails on complex queries requiring multi-step reasoning or precise calculations

Concrete Example: On a FRAMES benchmark question asking to convert 112 inches to mm, Perplexity's Sonar Reasoning Pro fails (answering 2,858mm). ODS correctly identifies 112 inches and uses the Wolfram Alpha tool to calculate the exact conversion to 2,845mm.

Key Novelty

Open Deep Search (ODS) Framework

Combines an 'Open Search Tool' (which rephrases queries, scrapes, and reranks content) with 'Open Reasoning Agents' (ReAct or CodeAct) that orchestrate tool usage
Integrates Chain-of-Thought Self-Consistency and dynamic few-shot prompting to enhance the reasoning reliability of open-source base models like DeepSeek-R1

Evaluation Highlights

+9.7% accuracy improvement on the FRAMES benchmark using ODS-v2+DeepSeek-R1 compared to GPT-4o Search Preview
88.3% accuracy on SimpleQA with ODS-v2+DeepSeek-R1, surpassing Perplexity Sonar Reasoning Pro (82.2%)
ODS-v1+DeepSeek-R1 achieves 69.8% on FRAMES, outperforming Perplexity Sonar Reasoning Pro (64.5%)

Breakthrough Assessment

8/10

Demonstrates that open-source agents using open models can surpass leading proprietary search products (GPT-4o, Perplexity) on difficult benchmarks, democratizing high-end search AI capabilities.

⚙️ Technical Details

Problem Definition

Setting: Search Engine-augmented Large Language Models (Search AI) for complex query answering

Inputs: User query q

Outputs: Final answer a, derived from real-time web information and reasoning

Pipeline Flow

Input: User Query
Agent: Open Reasoning Agent (ReAct or CodeAct) interprets query
Tool: Open Search Tool (Rephrasing -> Retrieval -> Augmentation)
Tool: Wolfram Alpha (optional for math)
Output: Final Answer

System Modules

Query Rephraser (Open Search Tool)

Generates k rephrased queries to bridge the gap between broad user intent and search engine keywords

Model or implementation: Base LLM (e.g., Llama3.1-70B or DeepSeek-R1)

Retriever (Open Search Tool)

Fetches search results using external APIs

Model or implementation: Serper.dev API

Augmenter (Open Search Tool)

Scrapes webpages, embeds chunks, and reranks passages to filter relevance

Model or implementation: Not explicitly specified (likely an embedding model + reranker)

Open Reasoning Agent

Orchestrates reasoning, selects tools (Search, Wolfram Alpha, Continue Thinking), and generates answers

Model or implementation: Base LLM (e.g., DeepSeek-R1)

Novel Architectural Elements

Open Search Tool pipeline: Explicit multistage process (Rephrase -> Retrieve -> Scrape/Chunk/Rerank) rather than raw SERP injection
Integration of dynamic few-shot learning for ReAct prompts using vector similarity matching

Modeling

Base Model: Llama3.1-70B or DeepSeek-R1

Comparison to Prior Work

vs. OpenPerplex/Perplexica: ODS employs a 'Open Search Tool' that rephrases, scrapes, chunks, and reranks content rather than just using raw SERP snippets
vs. Perplexity Sonar: ODS allows plug-and-play with any open LLM and exposes the reasoning/tool usage process transparently
vs. Self-Ask [not cited in paper]: ODS uses a full ReAct/CodeAct agent loop rather than a rigid self-ask decomposition structure

Limitations

Dependency on external APIs (Serper.dev, Wolfram Alpha) for core functionality
ReAct agent may fail to provide an answer if insufficient information is available (defaults to CoT-SC backup)
Latency implications of the multi-stage search (rephrasing + scraping + reranking) are not explicitly analyzed

Reproducibility

Code: https://github.com/sentient-agi/OpenDeepSearch

Code is publicly available at https://github.com/sentient-agi/OpenDeepSearch. The paper details the use of Serper.dev for search and Wolfram Alpha for math. Prompt templates (200 ReAct prompts) were designed via a community campaign, and examples are in Appendix B (referenced).

📊 Experiments & Results

Evaluation Setup

Evaluation on standard QA and reasoning benchmarks using open-source models augmented with ODS

Benchmarks:

SimpleQA (Factuality and short-answer QA)
FRAMES (Multi-hop reasoning and information retrieval)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ODS variants (v1/v2) combined with DeepSeek-R1 consistently outperform or match proprietary state-of-the-art models.
FRAMES	Accuracy	65.6	75.3	+9.7
SimpleQA	Accuracy	82.2	88.3	+6.1
SimpleQA	Accuracy	82.4	88.3	+5.9
FRAMES	Accuracy	30.1	75.3	+45.2

Main Takeaways

ODS-v2 (CodeAct) generally outperforms ODS-v1 (ReAct), specifically achieving higher accuracy on both SimpleQA and FRAMES.
Combining ODS with strong reasoning models like DeepSeek-R1 yields performance exceeding current proprietary leaders (GPT-4o Search, Perplexity Sonar).
The sophisticated Open Search Tool (rephrasing/scraping) provides significant context quality improvements over raw SERP injection used in prior open-source tools.

📚 Prerequisite Knowledge

Prerequisites

Retrieval Augmented Generation (RAG)
Agentic frameworks (ReAct, CodeAct)
Chain-of-Thought (CoT) prompting

Key Terms

ReAct: Reasoning and Action—a framework where LLMs generate reasoning traces and task-specific actions in an interleaved manner

CodeAct: An agent framework that uses executable code (e.g., Python) as the action space instead of JSON or text-based actions

SERP: Search Engine Result Page—the raw list of results returned by a search engine API

CoT-SC: Chain-of-Thought Self-Consistency—sampling multiple reasoning paths and selecting the most consistent answer

FreshPrompt: A prompting technique that includes metadata (title, URL, date) in the search result context to help the LLM judge relevance

SimpleQA: A benchmark dataset designed to evaluate the factuality of language models

FRAMES: A benchmark dataset designed to evaluate multi-hop reasoning and retrieval capabilities