Fudan University,
Zhejiang University,
University of California, Davis
arXiv
(2026)
AgentRAGRLReasoning
📝 Paper Summary
Agentic Reinforcement LearningExploration in RLRetrieval-Augmented Generation (RAG)
RAPO improves LLM agent training by dynamically injecting retrieved off-policy reasoning steps into on-policy rollouts and stabilizing updates with entropy-based retrieval rewards.
Core Problem
Existing Agentic RL methods rely on on-policy exploration, which restricts the agent to its own self-generated behaviors, while current off-policy methods only use external data for static trajectory-level estimation, missing fine-grained step-level dynamics.
Why it matters:
Pure on-policy paradigms constrain the exploration space to the agent's pre-existing capabilities, preventing the discovery of novel reasoning perspectives.
Simply adding off-policy trajectories to the training set (trajectory-level) fails to actively expand the agent's 'reasoning receptive field' during the rollout process itself.
Effective exploration is critical for agents to solve complex, multi-step tasks requiring tool use and diverse reasoning paths.
Concrete Example:In a standard setup, an agent struggling with a math problem might repeatedly try the same flawed reasoning path (on-policy). Even if a better path exists in an external buffer, the agent never sees it *during* its own reasoning process to pivot. RAPO injects that better step directly into the agent's current thought process.
Key Novelty
Retrieval-Augmented Policy Optimization (RAPO)
**Hybrid-policy Rollout:** Instead of generating every step itself, the agent probabilistically retrieves a 'step' (thought/action) from a buffer of high-quality off-policy traces and reasons conditioned on that external step.
**Retrieval-aware Optimization:** Uses an entropy-based reward to quantify if a retrieved step reduced uncertainty (was helpful) and an importance shaping mechanism to upweight gradients for these 'hybrid' trajectories.
Architecture
Comparison of exploration paradigms and the RAPO workflow. Fig 1(c) shows RAPO's hybrid rollout expanding the exploration space. Fig 2 likely shows the Step-Trace Buffer and retrieval process.
Evaluation Highlights
Achieves an average gain of +5.0% across fourteen datasets on three agentic reasoning tasks compared to baselines.
Delivers 1.2x faster training efficiency by reducing the number of on-policy tokens generated and optimizing gradient-bearing tokens more effectively.
Breakthrough Assessment
7/10
Novel integration of RAG directly into the RL exploration/rollout loop (step-level) rather than just context augmentation or static off-policy training. Addresses a core limitation of on-policy RL.
⚙️ Technical Details
Problem Definition
Setting: Agentic Reinforcement Learning for multi-step reasoning tasks
Inputs: Query q sampled from dataset Q
Outputs: Multi-step reasoning trajectory S = (s_0, s_1, ..., s_T-1) consisting of thoughts, actions, and observations
Pipeline Flow
Initialization: Generate first step on-policy
Hybrid Rollout Loop: At each step, decide to Retrieve or Generate
Retrieval (if triggered): Query Step-Trace Buffer with history -> Get off-policy trace -> Concat to history
Generation (if not triggered): Sample from current policy
Store high-quality step-level traces decomposed from off-policy trajectories
Model or implementation: Key-Value Store
Retrieval Mechanism
Dynamically retrieve relevant off-policy steps during rollout
Model or implementation: RAG-based retriever
Policy Agent
Generate thoughts and actions, reasoning conditioned on potentially retrieved external traces
Model or implementation: LLM (architecture not specified in snippet)
Novel Architectural Elements
Hybrid-policy Agentic Rollout: A mechanism to interleave retrieved off-policy steps into an on-policy rollout trajectory dynamically.
Step-Trace Buffer: Decomposing trajectories into step-level Key-Value pairs for fine-grained retrieval context.
Modeling
Base Model: Large Language Model (Specific variant not mentioned in text snippet)
Training Method: Retrieval-Augmented Policy Optimization (RAPO)
Objective Functions:
Purpose: Quantify retrieval quality using entropy reduction.
Formally: Z_ret = Mean(g_s^t * H_{s^t-1}) where g is a scaled tanh of entropy difference.
Purpose: Calibrate gradient estimation for hybrid trajectories.
Formally: Reshape importance sampling ratio r_{t,j} using retrieved-token proportion F_ret.
Purpose: Combined optimization objective.
Formally: Maximize advantage A_combined = A_acc + a * A_ret using GRPO-style clipped objective.
Key Hyperparameters:
clip_epsilon: Not reported in the paper snippet
learning_rate: Not reported in the paper snippet
Compute: 1.2x faster training efficiency (qualitative claim)
Comparison to Prior Work
vs. GRPO: RAPO introduces off-policy retrieval during rollout and modifies the loss with retrieval rewards.
vs. Adaptive Branching/Tree-Search: These are purely on-policy exploration methods; RAPO explicitly injects external off-policy behaviors to expand the search space.
vs. Trajectory-level Off-policy methods (Yan et al., 2025): RAPO operates at the step-level dynamics rather than using full trajectories for static estimation.
Limitations
Dependency on the quality of the off-policy buffer; poor off-policy traces could mislead the agent.
Requires an existing off-policy agent or data source to populate the Step-Trace Buffer.
Complexity of managing hybrid trajectories and ensuring stable gradients with retrieved tokens (addressed by Importance Shaping but still non-trivial).
Reproducibility
The paper presents a clear algorithmic framework. However, the snippet does not provide specific code URLs, base model names, or detailed hyperparameters (learning rates, batch sizes), making exact reproduction impossible without the full text or appendices.