← Back to Paper List

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization

Siwei Zhang, Yun Xiong, Xi Chen, Zi'an Jia, Renhong Huang, Jiarong Xu, Jiawei Zhang
Fudan University, Zhejiang University, University of California, Davis
arXiv (2026)
Agent RAG RL Reasoning

📝 Paper Summary

Agentic Reinforcement Learning Exploration in RL Retrieval-Augmented Generation (RAG)
RAPO improves LLM agent training by dynamically injecting retrieved off-policy reasoning steps into on-policy rollouts and stabilizing updates with entropy-based retrieval rewards.
Core Problem
Existing Agentic RL methods rely on on-policy exploration, which restricts the agent to its own self-generated behaviors, while current off-policy methods only use external data for static trajectory-level estimation, missing fine-grained step-level dynamics.
Why it matters:
  • Pure on-policy paradigms constrain the exploration space to the agent's pre-existing capabilities, preventing the discovery of novel reasoning perspectives.
  • Simply adding off-policy trajectories to the training set (trajectory-level) fails to actively expand the agent's 'reasoning receptive field' during the rollout process itself.
  • Effective exploration is critical for agents to solve complex, multi-step tasks requiring tool use and diverse reasoning paths.
Concrete Example: In a standard setup, an agent struggling with a math problem might repeatedly try the same flawed reasoning path (on-policy). Even if a better path exists in an external buffer, the agent never sees it *during* its own reasoning process to pivot. RAPO injects that better step directly into the agent's current thought process.
Key Novelty
Retrieval-Augmented Policy Optimization (RAPO)
  • **Hybrid-policy Rollout:** Instead of generating every step itself, the agent probabilistically retrieves a 'step' (thought/action) from a buffer of high-quality off-policy traces and reasons conditioned on that external step.
  • **Retrieval-aware Optimization:** Uses an entropy-based reward to quantify if a retrieved step reduced uncertainty (was helpful) and an importance shaping mechanism to upweight gradients for these 'hybrid' trajectories.
Architecture
Architecture Figure Figure 1(c) & Figure 2
Comparison of exploration paradigms and the RAPO workflow. Fig 1(c) shows RAPO's hybrid rollout expanding the exploration space. Fig 2 likely shows the Step-Trace Buffer and retrieval process.
Evaluation Highlights
  • Achieves an average gain of +5.0% across fourteen datasets on three agentic reasoning tasks compared to baselines.
  • Delivers 1.2x faster training efficiency by reducing the number of on-policy tokens generated and optimizing gradient-bearing tokens more effectively.
Breakthrough Assessment
7/10
Novel integration of RAG directly into the RL exploration/rollout loop (step-level) rather than just context augmentation or static off-policy training. Addresses a core limitation of on-policy RL.
×