← Back to Paper List

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan, Tao Yang, Fengran Mo, Jiacheng Lin, Xian Li, Jingbo Shang
University of California, San Diego
arXiv (2026)
Agent RL Reasoning QA

📝 Paper Summary

Research Agents Reinforcement Learning with Verifiable Rewards (RLVR)
SynPlanResearch-R1 improves research agents by initializing them with synthetic, plan-guided trajectories that enforce diverse tool usage patterns before applying reinforcement learning, preventing premature convergence to shallow search behaviors.
Core Problem
RLVR-trained research agents often fail to discover effective tool-use strategies because they initialize from weak policies, leading to premature termination and biased, repetitive tool usage (e.g., over-relying on search, under-using crawling).
Why it matters:
  • Current agents stagnate in local optima, producing shallow answers for complex queries because they don't explore enough search steps
  • On-policy RL (like RLVR) bootstraps from the agent's own rollouts; if the starting policy is poor, the agent rarely sees high-reward deep exploration trajectories to learn from
  • Agents exhibit strong bias toward familiar tools (web_search) while neglecting others (crawl_webpage), limiting evidence gathering
Concrete Example: For a complex query, a standard agent might issue one search and guess the answer immediately (premature termination). In contrast, SynPlanResearch-R1 forces the model to follow a plan like 'search -> crawl -> search -> crawl', discovering deep evidence it would otherwise miss.
Key Novelty
Plan-Guided Data Synthesis for Cold-Start SFT
  • Generates randomized 'tool plans' (sequences of required tool actions) to force the model to explore diverse, long-horizon research paths during data generation
  • Injects tool-dependent cues into the 'thought' process to softly guide the model to follow the plan without breaking natural reasoning flow
  • Uses a high-quality rewriter to paraphrase these cues into natural language, creating a high-quality synthetic dataset for supervised fine-tuning initialization
Evaluation Highlights
  • Achieves up to +5.1% accuracy gain on multi-hop QA benchmarks and +8.7% on advanced QA benchmarks (GPQA, GAIA) using Qwen3-8B compared to SOTA baselines
  • Consistent improvements across model scales: +5.2% and +6.0% gains on respective benchmarks with Qwen3-4B backbones
  • Maintains higher policy entropy during early RL training, indicating the agent explores more diverse strategies rather than collapsing into a narrow solution path
Breakthrough Assessment
7/10
Strong empirical results and a clever, practical solution to the 'exploration problem' in RLVR by fixing the initialization. While the components (SFT + RL) are standard, the plan-guided synthesis is a novel and effective patch for agent myopia.
×