DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent via Multi-Reward Reinforcement Learning

📝 Paper Summary

Agentic RAG pipeline Reinforcement Learning for Agents

DynaSearcher enhances search agents by integrating dynamic knowledge graph retrieval to guide reasoning paths and using multi-reward reinforcement learning to balance accuracy, efficiency, and response quality.

Core Problem

Current search agents rely on unstructured text and coarse outcome-based rewards, leading to noisy intermediate queries, inefficient search trajectories, and hallucinations during complex reasoning.

Why it matters:

Reliability on static parametric knowledge causes hallucinations in LLMs, while standard RAG struggles with complex multi-hop questions
Existing prompt-based agents (like ReAct or CoT) are sensitive to prompt formulation and fail to fully exploit agentic potential
Current RL-based search agents use coarse global rewards that fail to provide fine-grained guidance for intermediate steps, leading to redundant computations

Concrete Example: In a multi-hop question about two entities, a standard search agent might retrieve irrelevant documents due to keyword matching noise, diverting the reasoning path. DynaSearcher uses a knowledge graph to explicitly model the relationship between the entities, ensuring the intermediate query targets the correct connecting fact.

Key Novelty

Dynamic Knowledge Graph Augmented Multi-Reward RL

Integrates structured Knowledge Graphs (Wikidata) alongside document search to explicitly model entity relationships, guiding the agent away from noisy text and towards factually consistent queries
Deploys a multi-reward RL framework (accuracy + information gain + penalty) that specifically rewards high-quality intermediate queries while penalizing redundant or excessive search steps

Architecture

Overview of the DynaSearcher framework, detailing the iterative loop of reasoning, planning, and dual-retrieval (Document + Knowledge Graph).

Evaluation Highlights

+4.0 F1 improvement on HotpotQA (multi-hop) compared to Search-R1-v0.3 baseline using Qwen2.5-7B
Outperforms GPT-4.1 on HotpotQA (66.1 F1 vs 60.6 F1) using a much smaller Qwen2.5-7B base model
Achieves state-of-the-art results across six datasets (including 2Wiki, Musique, Bamboogle) compared to strong baselines like DeepSeek-R1 and ReSearch

Breakthrough Assessment

8/10

Significant performance jumps on complex reasoning tasks using small models (7B) by effectively combining structured knowledge (KG) with fine-grained RL incentives, outperforming much larger frontier models.

⚙️ Technical Details

Problem Definition

Setting: Open-domain multi-hop question answering where an agent must autonomously plan, retrieve information, and generate answers

Inputs: Natural language question x

Outputs: Answer y produced after a sequence of interleaved reasoning <think> and retrieval <search> steps

Pipeline Flow

Input Processing: Question → Planning
Action Generation: LLM generates <think> trace and <search> extraction (entities/relations)
Retrieval & Selection: Doc Search (Vector/Web) + KG Search (Wikidata) → Filter → Context
Generation: LLM integrates context → Answer

System Modules

Policy Model

Generate reasoning traces, search queries, and final answers

Model or implementation: Qwen2.5-7B-Instruct or Qwen2.5-32B-Instruct

Doc Search Tool (Retrieval & Selection)

Retrieve unstructured text evidence

Model or implementation: multilingual-e5-base (local) or Tavily (web)

KG Search Tool (Retrieval & Selection)

Retrieve structured entity relationships to guide reasoning

Model or implementation: Fuzzy matching on Wikidata5M + KG Filter Module (LLM-based)

Novel Architectural Elements

Integration of a dynamic Knowledge Graph retrieval loop explicitly interleaved with standard document search within an RL-trained agent
KG Filter Module: A specific sub-step where an LLM filters noisy KG triples before adding them to the context

Modeling

Base Model: Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Ensure the final answer matches ground truth.

Formally: r_acc = r_format + r_ans (where r_ans combines F1 and Cover Exact Match scores)
Purpose: Encourage retrieval of relevant documents.

Formally: r_recall = (TP / (TP + FN))
Purpose: Penalize excessive search steps to improve efficiency.

Formally: r_penalty = γ^(t-i) * β (applied if current steps t > ground truth hops i)
Purpose: Combine all signals.

Formally: r_total = r_acc + α * (r_recall + r_penalty)

Training Data:

Stage-2 data from Song et al. (2025)
8,000 randomly sampled instances from Musique

Key Hyperparameters:

learning_rate: 1e-6
train_batch_size: 16
epochs: 1
+ 5 more
kl_coefficient: 1e-3
alpha (reward balance): 0.5
gamma (penalty decay): 0.9
beta (penalty lower bound): -0.2
rollouts_per_input: 8

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-o1: DynaSearcher uses RL training rather than just prompting, and integrates structured KG retrieval.
vs. DeepSeek-R1: DynaSearcher is explicitly optimized for tool use (search) with fine-grained rewards, whereas R1 focuses on general reasoning via outcome rewards.
vs. StepSearch [not cited in paper]: Both use fine-grained rewards, but DynaSearcher uniquely integrates Knowledge Graphs to reduce intermediate query noise.
+ 1 more
vs. ReSearch: DynaSearcher adds KG augmentation and a specific multi-reward structure (gain/penalty) rather than just outcome-based RL.

Limitations

Relies on the coverage and quality of the external Knowledge Graph (Wikidata5M); incomplete KGs may limit effectiveness.
KG entity linking and filtering adds computational overhead compared to pure text search.
Experiments focus on Qwen2.5 models; transferability to other architectures is not explicitly detailed.

Reproducibility

Code: https://modelscope.cn/collections/DynaSearcher-a00139d1ef2542

Code and models are available at https://modelscope.cn/collections/DynaSearcher-a00139d1ef2542. The paper specifies the base models (Qwen2.5), training datasets (Song et al. stage-2, Musique), and key hyperparameters for RL. Wikidata5M is used for the KG.

📊 Experiments & Results

Evaluation Setup

Multi-hop Question Answering across 6 datasets (3 in-domain, 3 out-of-domain/generalization)

Benchmarks:

HotpotQA (Multi-hop QA (In-domain))
2WikiMultiHopQA (Multi-hop QA (In-domain))
Musique (Multi-hop QA (In-domain))
Bamboogle (Multi-hop QA (Out-of-domain))
MoreHopQA (Multi-hop QA (Out-of-domain))
Frames (Multi-hop QA (Out-of-domain))

Metrics:

F1 score
Cover Exact Match (CEM)
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing DynaSearcher (7B) against baselines on standard multi-hop QA datasets.
HotpotQA	F1	61.8	66.1	+4.3
2WikiMultiHopQA	F1	67.1	72.0	+4.9
HotpotQA	F1	60.6	66.1	+5.5
Ablation studies validating the contributions of the Knowledge Graph (KG) and Multi-Reward (MR) components.
HotpotQA	F1	61.8	66.1	+4.3
HotpotQA	F1	63.5	66.1	+2.6

Experiment Figures

Performance comparison (F1 score) of DynaSearcher vs. baselines (Search-R1, ReSearch) under different context length limits on HotpotQA.

Main Takeaways

DynaSearcher consistently outperforms standard RAG, prompt-based agents, and other RL-based agents across multiple datasets.
The 7B model version rivals or beats much larger closed-source models (GPT-4.1, Gemini-2.5-Pro) on specific multi-hop benchmarks.
Ablation studies confirm that both the Knowledge Graph augmentation and the Multi-Reward mechanism independently contribute to performance gains.
The method shows strong generalization to out-of-domain datasets (Bamboogle, MoreHopQA, Frames), suggesting it learns robust reasoning patterns rather than just memorizing training data.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (specifically GRPO)
Retrieval-Augmented Generation (RAG)
Knowledge Graphs (Wikidata)

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that optimizes a policy by comparing a group of outputs generated for the same input

Knowledge Graph (KG): A structured representation of knowledge using entities (nodes) and relationships (edges), used here to reduce reasoning noise

CEM: Cover Exact Match—a metric measuring if the ground truth answer is contained within the generated prediction

Multi-reward RL: A reinforcement learning approach that uses multiple distinct reward signals (accuracy, efficiency, penalties) rather than a single outcome reward

Wikidata5M: A large-scale knowledge graph dataset used as the source for structured entity-relationship retrieval

Tavily: A web search API used to retrieve up-to-date unstructured text information from the internet