SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

📝 Paper Summary

Research Agents Reinforcement Learning with Verifiable Rewards (RLVR)

SynPlanResearch-R1 improves research agents by initializing them with synthetic, plan-guided trajectories that enforce diverse tool usage patterns before applying reinforcement learning, preventing premature convergence to shallow search behaviors.

Core Problem

RLVR-trained research agents often fail to discover effective tool-use strategies because they initialize from weak policies, leading to premature termination and biased, repetitive tool usage (e.g., over-relying on search, under-using crawling).

Why it matters:

Current agents stagnate in local optima, producing shallow answers for complex queries because they don't explore enough search steps
On-policy RL (like RLVR) bootstraps from the agent's own rollouts; if the starting policy is poor, the agent rarely sees high-reward deep exploration trajectories to learn from
Agents exhibit strong bias toward familiar tools (web_search) while neglecting others (crawl_webpage), limiting evidence gathering

Concrete Example: For a complex query, a standard agent might issue one search and guess the answer immediately (premature termination). In contrast, SynPlanResearch-R1 forces the model to follow a plan like 'search -> crawl -> search -> crawl', discovering deep evidence it would otherwise miss.

Key Novelty

Plan-Guided Data Synthesis for Cold-Start SFT

Generates randomized 'tool plans' (sequences of required tool actions) to force the model to explore diverse, long-horizon research paths during data generation
Injects tool-dependent cues into the 'thought' process to softly guide the model to follow the plan without breaking natural reasoning flow
Uses a high-quality rewriter to paraphrase these cues into natural language, creating a high-quality synthetic dataset for supervised fine-tuning initialization

Evaluation Highlights

Achieves up to +5.1% accuracy gain on multi-hop QA benchmarks and +8.7% on advanced QA benchmarks (GPQA, GAIA) using Qwen3-8B compared to SOTA baselines
Consistent improvements across model scales: +5.2% and +6.0% gains on respective benchmarks with Qwen3-4B backbones
Maintains higher policy entropy during early RL training, indicating the agent explores more diverse strategies rather than collapsing into a narrow solution path

Breakthrough Assessment

7/10

Strong empirical results and a clever, practical solution to the 'exploration problem' in RLVR by fixing the initialization. While the components (SFT + RL) are standard, the plan-guided synthesis is a novel and effective patch for agent myopia.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn question answering using external tools (ReAct framework)

Inputs: User query q, tool descriptions, and instructions

Outputs: A trajectory of Thought-Action-Observation steps culminating in a final answer

Pipeline Flow

Tool Plan Generator (creates random sequence of tool actions)
Cue Injection (inserts soft guidance into 'thought' blocks)
Trajectory Synthesis (LRM generates full reasoning paths)
Filtration & Rewriting (removes invalid paths, paraphrases cues)
Cold-Start SFT (trains initial policy on synthetic data)
RLVR Optimization (refines policy using outcome rewards)

System Modules

Tool Plan Generator (Data Synthesis)

Creates a randomized sequence of tool actions (e.g., search, crawl, search) to enforce diversity

Model or implementation: Rule-based script

Thought Rewriter (Data Synthesis)

Paraphrases the cue-injected thoughts to sound natural and remove artifacts

Model or implementation: Claude (as cited in paper)

Research Agent

Executes multi-turn research to answer user queries

Model or implementation: Qwen3-8B / Qwen3-4B

Novel Architectural Elements

Plan-guided cue injection mechanism for synthetic data generation: purposefully constraining the LRM during data creation to force deep exploration trajectories that are then distilled into the SFT policy

Modeling

Base Model: Qwen3-8B and Qwen3-4B

Training Method: GRPO (Group Relative Policy Optimization)

Objective Functions:

Purpose: Optimize policy to maximize expected reward.

Formally: GRPO objective minimizing 1/G * sum(min(ratio * A, clip(ratio) * A)) - beta * KL
Purpose: Shape reward based on correctness and format.

Formally: r(q, y) = s_ans (F1 score) + f * alpha (format bonus) if valid, else penalties

Adaptation: Full parameter tuning implied (standard for 4B/8B models)

Training Data:

Synthesis using randomized tool plans (lengths L_min to L_max)
Filtering: Retain only trajectories with correct answers and valid ReAct format
Rewriting: Use Claude to paraphrase cues

Key Hyperparameters:

reward_threshold_tau: 0.8 (for F1 accuracy)
tool_plan_action_probability: 0.5 (equal chance for search/crawl)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Simple-RLVR: SynPlanResearch-R1 explicitly shapes the initial distribution to include deep tool-use chains, whereas Simple-RLVR relies on random exploration which often fails
vs. Iterative-SFT: SynPlanResearch-R1 uses RLVR for the final stage rather than just SFT, allowing on-policy improvement
vs. DeepSeek-R1 [not cited in paper]: DeepSeek-R1 uses RLVR for reasoning chains (CoT); this paper adapts the concept specifically for external tool-use trajectories

Limitations

Relies on a stronger teacher model (LRM) and Claude for data synthesis, increasing cost
Synthesis process rejects questions where the LRM fails to find the answer, potentially creating a bias in the training set
Reward signal depends on having ground truth answers, limiting applicability to open-ended research where truth is unknown

Reproducibility

Code: https://github.com/HansiZeng/syn-plan-research

Code is publicly available at https://github.com/HansiZeng/syn-plan-research. The paper details the prompt structure, cue templates (Appendix A.3), and rewriting strategy. It relies on a 'large reasoning model' (zero-prompted LRM) and Claude for synthesis, which are closed-source dependencies.

📊 Experiments & Results

Evaluation Setup

Evaluated on 7 benchmarks covering multi-hop QA and open-web research

Benchmarks:

HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
MuSiQue (Multi-hop QA)
Bamboogle (Multi-hop QA)
GPQA (Advanced Domain QA)
WebWalkerQA (Web Agent QA)
GAIA (General Assistant QA)

Metrics:

Accuracy (Exact Match or F1 > 0.8)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons using Qwen3-8B and Qwen3-4B backbones show SynPlanResearch-R1 outperforming baselines on aggregated benchmark sets.
Multi-hop QA (Average)	Accuracy	59.2	64.3	+5.1
Advanced QA (Average)	Accuracy	41.5	50.2	+8.7
Multi-hop QA (Average)	Accuracy	52.8	58.0	+5.2
Advanced QA (Average)	Accuracy	35.5	41.5	+6.0

Main Takeaways

SynPlanResearch-R1 consistently outperforms vanilla RLVR and other baselines across model sizes (4B and 8B), proving the value of better SFT initialization.
The method is particularly effective on 'Advanced QA' (GPQA, GAIA) where deep reasoning and diverse tool use are critical.
Analysis reveals that SynPlanResearch-R1 agents engage in longer trajectories and more diverse tool usage patterns compared to baseline agents which terminate early.
Training dynamics show higher policy entropy, suggesting the plan-guided initialization successfully prevents early mode collapse during RL.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Verifiable Rewards (RLVR)
ReAct prompting (Reasoning + Acting)
Supervised Fine-Tuning (SFT)
Proximal Policy Optimization or GRPO

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—optimizing a model using only the correctness of the final answer (e.g., math or code execution) without human preference labels

Cold-start SFT: The initial supervised fine-tuning phase used to teach a model basic instruction-following and formatting before reinforcement learning begins

ReAct: Reasoning and Acting—a prompting framework where models generate a 'Thought' (reasoning trace) before emitting an 'Action' (tool call)

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input, eliminating the need for a separate value network

Trajectory: The sequence of thoughts, tool actions, and observations generated by the agent to solve a single problem

WebWalkerQA: A benchmark for evaluating web agents that navigate and extract information from websites

GPQA: A challenging QA dataset written by domain experts (biology, physics, chemistry) difficult for non-experts to answer

GAIA: A benchmark for General AI Assistants evaluating reasoning, tool use, and multi-modality

Policy Entropy: A measure of the randomness in the agent's actions; higher entropy implies more exploration and less certainty/collapse on a single behavior

On-policy: RL methods where the data used for training is generated by the current version of the policy itself