Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools

📝 Paper Summary

Agentic RAG pipeline Memory organization Tool-use post-training

Agentic Reasoning enhances LLM problem-solving by dynamically integrating web search, code execution, and a structured Mind-Map memory into the reasoning chain to handle complex, knowledge-intensive tasks.

Core Problem

Current reasoning models excel in structured domains like math/code but struggle with open-ended, knowledge-intensive tasks requiring extensive research and maintaining coherence over long reasoning chains.

Why it matters:

Applying math/code-style reasoning to social sciences or experiential fields often produces flawed or overly rigid results
Open-source models lag behind proprietary systems (like OpenAI Deep Research) in deep research capabilities due to lack of effective external tool integration
LLMs frequently lose track of context or hallucinate when attempting long reasoning sequences without structured memory

Concrete Example: When asked a riddle about family relationships involving a surgeon ('The surgeon... says I can't operate on this child, he's my son!'), DeepSeek-R1 fails after 17 seconds due to bias. Agentic Reasoning uses a Mind-Map to explicitly graph the entities [surgeon], [boy], and [father], correctly identifying the relationship.

Key Novelty

Agentic Reasoning Framework with Mind-Map Memory

Integrates three specific agents (Web-Search, Code, Mind-Map) directly into the reasoning loop via special tokens, allowing the model to pause, query tools, and reintegrate results
Introduces a 'Mind-Map' agent that constructs a dynamic knowledge graph from the reasoning context, allowing the model to query its own past thoughts and maintain coherence over long chains
Optimizes the Web-Search agent by combining query breakdown, reranking, and Mind-Map context, finding this superior to standard RAG or knowledge refinement alone

Architecture

The Agentic Reasoning workflow where the LLM halts generation to invoke external tools (Search, Code, Mind-Map) and reintegrates results.

Evaluation Highlights

Achieves 23.8% accuracy on Humanity's Last Exam, a 14.4% improvement over the raw model and narrowing the gap with OpenAI Deep Research to 2.8%
Surpasses o3-mini-high on GPQA Diamond benchmark with 66.8% accuracy (vs 64.1%)
Establishes a new state-of-the-art on GAIA benchmark among public methods, outperforming OpenAI Deep Research on Level 1 and Level 2 tasks

Breakthrough Assessment

9/10

Significantly narrows the gap between open-source and proprietary 'Deep Research' models. The Mind-Map concept for maintaining reasoning coherence is a strong architectural contribution.

⚙️ Technical Details

Problem Definition

Setting: Complex problem solving and deep research requiring multi-step reasoning and external information

Inputs: Natural language query (potentially expert-level or open-ended)

Outputs: Final reasoned answer or comprehensive research report

Pipeline Flow

Reasoning LLM (DeepSeek-R1) generates thought chain
Token Detection (pauses if tool token found)
Tool Execution (Web-Search, Code, or Mind-Map)
Result Integration (updates context)
Resume Reasoning

System Modules

Reasoning Engine

Main reasoning agent that decides when to call tools via special tokens

Model or implementation: DeepSeek-R1

Web-Search Agent (Tooling)

Retrieves and processes web information

Model or implementation: DeepSeek-V3 (for query breakdown/RAG) + Bing Search + Cohere Rerank 3.5

Code Agent (Tooling)

Executes computational tasks

Model or implementation: claude-3.5-sonnet (generation) + Python 3.11 (execution)

Mind-Map Agent

Constructs and queries a knowledge graph of the reasoning process

Model or implementation: DeepSeek-V3 (graph construction/retrieval)

Novel Architectural Elements

Mind-Map Agent: A dynamic memory module that builds a knowledge graph *of the reasoning process itself* (not just static data) to allow the model to query its own previous logic
Integration of Mind-Map context into Web-Search query refinement (using the reasoning graph to disambiguate search queries)

Modeling

Base Model: DeepSeek-R1

Training Method: Inference-time framework (no training reported)

Key Hyperparameters:

max_tokens: 32,768
temperature: 0.7
top_p: 0.8
+ 4 more
top_k: 20
repetition_penalty: 1.05
rerank_threshold: 0.7
max_search_iterations: 3

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-O1: Agentic Reasoning includes a structured Mind-Map memory and specialized coding agent, whereas Search-O1 focuses primarily on search integration
vs. OpenAI Deep Research: Agentic Reasoning is an open framework achieving comparable results on expert tasks using open-weights models (DeepSeek-R1) rather than a closed system
vs. Standard RAG: Uses active agentic tools and dynamic knowledge graphs rather than static retrieval [not cited in paper]

Limitations

Dependency on powerful underlying reasoning models (DeepSeek-R1) and external APIs (Claude, Bing)
Latency concerns due to iterative retrieval, graph construction, and multiple LLM calls
Mind-Map construction adds computational overhead compared to simple context windows

Reproducibility

Code: https://github.com/theworldofagents/Agentic-Reasoning

Code is publicly available at https://github.com/theworldofagents/Agentic-Reasoning. Uses proprietary/external APIs for some components (Bing Search, Cohere Rerank, Claude-3.5-Sonnet for coding), which creates dependencies for full replication.

📊 Experiments & Results

Evaluation Setup

Expert-level problem solving and deep research tasks

Benchmarks:

Humanity's Last Exam (HLE) (Expert-level QA across broad subjects)
GPQA Diamond (PhD-level multiple-choice science QA)
GAIA (General AI Assistants benchmark (reasoning, browsing, tool-use))
FreshWiki (Deep research/article generation)
Open Deep Research Tasks (Real-world expert questions (Finance, Medicine, Law)) [New]

Metrics:

Accuracy
Win Rate (for Werewolf game)
ROUGE scores
Entity Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on expert-level benchmarks shows Agentic Reasoning achieving SOTA among open methods.
Humanity's Last Exam (HLE)	Accuracy	9.4	23.8	+14.4
GPQA Diamond	Accuracy	59.1	66.8	+7.7
GAIA Level 3	Accuracy	12.86	17.14	+4.28
GAIA Level 1	Accuracy	46.10	52.94	+6.84
Ablation studies demonstrate the critical role of the Mind-Map and tool combinations.
Werewolf Game	Win Rate	36	72	+36
GPQA	Accuracy	61.3	66.8	+5.5

Experiment Figures

Performance analysis relative to reasoning length (number of tool calls).

Main Takeaways

Tool quality matters more than quantity; increasing tool count (e.g., via LangChain) often degraded performance compared to a focused set of three agents.
Mind-Map is essential for long reasoning chains; it significantly boosts performance on questions requiring many steps and helps resolve complex logical relationships (e.g., riddles, Werewolf).
Web-search strategies benefit most from Query Breakdown and Reranking; standard Knowledge Refinement was found ineffective when combined with these.
The framework effectively bridges the gap between open-source models and proprietary 'Deep Research' systems, particularly in structured reasoning tasks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with Chain-of-Thought (CoT) reasoning
Basic knowledge of Knowledge Graphs
Concepts of LLM tool use (function calling)

Key Terms

Mind-Map: A structured knowledge graph agent that stores reasoning context, clusters it, and allows the model to query past reasoning steps to maintain coherence

DeepSeek-R1: The base large reasoning model used as the primary LLM in this framework

GraphRAG: A method using knowledge graphs to structure and retrieve information, used here to build the Mind-Map

GPQA: A PhD-level multiple-choice science QA benchmark used to evaluate expert-level reasoning

GAIA: A benchmark for AI agents assessing reasoning, web browsing, and tool-use proficiency

Humanity's Last Exam: A difficult benchmark assessing AI performance across a broad range of expert subjects

Cohere Rerank: A commercial reranking model used to filter and order search results based on relevance

ROUGE: A set of metrics used to evaluate automatic summarization and machine translation by comparing to human references

SOTA: State-of-the-Art—the current best performance achievable by any known method