Agentic Entropy-Balanced Policy Optimization

📝 Paper Summary

RL-based Web agents

AEPO stabilizes agentic reinforcement learning by dynamically allocating rollout budgets based on entropy and preserving high-entropy token gradients during updates to prevent collapse and sustain exploration.

Core Problem

Entropy-guided agentic RL suffers from 'High-Entropy Rollout Collapse' (over-branching on single trajectories) and 'High-Entropy Token Gradient Clipping' (prematurely suppressing exploration due to aggressive gradient clipping).

Why it matters:

Current methods rely on entropy for exploration but fail to manage it, causing models to deplete sampling budgets on narrow high-entropy paths rather than exploring diverse strategies.
Vanilla RL algorithms aggressively clip gradients for high-entropy tokens (which often signal valuable tool-use uncertainty), causing the model to stop exploring and collapse into fixed reasoning patterns early in training.

Concrete Example: In a web search task, an agent might encounter 6 consecutive high-entropy steps. Existing methods like ARPO would burn the entire branching budget on this single chain, starving other potential paths (93.4% of branches concentrated on 1-3 trajectories), while subsequent policy updates would clip these high-entropy gradients, effectively ignoring the exploration signal.

Key Novelty

Agentic Entropy-Balanced Policy Optimization (AEPO)

Dynamic Entropy-Balanced Rollout: Adaptively allocates global vs. branch sampling budgets by pre-monitoring entropy gaps, and penalizes consecutive high-entropy branches to force wider rather than deeper exploration.
Entropy-Balanced Policy Optimization: Modifies the PPO clipping mechanism with a 'stop-gradient' term that allows high-entropy tokens to retain their gradients (rescaled) rather than being zeroed out, ensuring the model learns from uncertainty.

Architecture

The complete AEPO framework illustrating the two main phases: Dynamic Entropy-Balanced Rollout and Entropy-Balanced Policy Optimization.

Evaluation Highlights

+3.4% Pass@1 on GAIA benchmark (47.6% vs 44.2% for ARPO) using Qwen3-14B with only 1k samples.
+4.0% Pass@1 on WebWalkerQA (43.0% vs 39.0% for ARPO), demonstrating improved generalization in web navigation tasks.
Achieves 26.0% Pass@5 on Humanity's Last Exam, significantly outperforming PPO (13.6%) and ARPO (22.2%).

Breakthrough Assessment

8/10

Strong methodological contribution addressing specific failure modes of entropy-based RL (collapse and clipping). Consistent improvements across 14 diverse benchmarks reinforce its efficacy for generalist web agents.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn Agentic Reinforcement Learning where a policy interacts with a tool environment (search, browser, code executor) to solve complex queries.

Inputs: Natural language query x from dataset D

Outputs: Tool-integrated trajectory y containing reasoning thoughts and tool-call results

Pipeline Flow

Entropy Pre-monitoring: Generate initial trajectory, compute entropy gap, allocate global vs. branch budget
Dynamic Rollout: Generate trajectories, branching at high-entropy steps with penalties for consecutive branching
Policy Update: Calculate entropy-aware advantages and update policy using clipping-balanced objective

System Modules

Entropy Pre-monitor (Rollout Control)

Determine sampling budget allocation (m global samples vs k-m branch samples)

Model or implementation: Qwen3-14B-Instruct (Policy Model)

Branching Mechanism (Rollout Control)

Execute tree-structured rollout with penalty for consecutive high-entropy steps

Model or implementation: Qwen3-14B-Instruct (Policy Model)

Policy Optimizer

Update model weights using entropy-balanced loss

Model or implementation: Qwen3-14B-Instruct (Policy Model)

Novel Architectural Elements

Dynamic allocation of global vs. branch sampling budget based on entropy gap (Information Gain perspective).
Consecutive branch penalty logic within the rollout tree construction.
Stop-gradient insertion into the PPO clipping term to decouple forward/backward passes for high-entropy tokens.

Modeling

Base Model: Qwen3-14B-Instruct

Training Method: Agentic Entropy-Balanced Policy Optimization (AEPO)

Objective Functions:

Purpose: Optimize policy while preserving high-entropy gradients.

Formally: Maximizes expected advantage with a modified clipping term containing stop-gradient sg(δ) that rescales gradients when outside clip range.
Purpose: Prioritize learning on high-uncertainty tokens.

Formally: Reshapes advantage A_total = A_acc + λ * A_ent, where A_ent is derived from token entropy.

Training Data:

Uses 1k training samples from a mixture of datasets (HotpotQA, 2WikiMultihopQA, GSM8k, MATH, etc.)
Generated rollouts using the dynamic entropy-balanced mechanism

Key Hyperparameters:

training_samples: 1000
rollout_budget_k: 8
branch_threshold_tau: 0.45
+ 2 more
entropy_stabilization_gamma: 0.1
clipping_epsilon: 0.2

Compute: Not reported in the paper

Comparison to Prior Work

vs. ARPO: AEPO adds dynamic budget allocation and consecutive branch penalties; ARPO uses fixed thresholds and suffers from collapse.
vs. GRPO: AEPO uses tree-structured rollouts and modifies the clipping mechanism; GRPO uses standard i.i.d. sampling and vanilla clipping.
vs. DAPO: AEPO uses stop-gradient rescaling for high-entropy tokens; DAPO just increases the clip threshold.

Limitations

Computational cost of entropy calculation at every step during rollout.
Complexity of tuning additional hyperparameters (alpha, beta, gamma) for the rollout mechanism.
Reliance on 1k training samples; scaling behavior to massive datasets not fully explored.

Reproducibility

Code: https://github.com/dongguanting/ARPO

Code is publicly available at https://github.com/dongguanting/ARPO. Hyperparameters for rollout and entropy calculations are specified.

📊 Experiments & Results

Evaluation Setup

Agentic tasks involving Web Search, Web Browser, and Code Executor.

Benchmarks:

GAIA (General AI Assistant (Level 1-3))
Humanity's Last Exam (HLE) (Hard reasoning/knowledge)
WebWalkerQA (Web navigation and QA)
HotpotQA (Multi-hop QA)

Metrics:

Pass@1
Pass@5
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparison on General Web Agent Benchmarks (Pass@1) shows AEPO consistently outperforming baselines.
WebWalkerQA	Pass@1	39.0	43.0	+4.0
Humanity's Last Exam	Pass@1	10.4	11.2	+0.8
Pass@5 results demonstrate AEPO's ability to generate diverse and correct solutions.
Humanity's Last Exam	Pass@5	22.2	26.0	+3.8

Experiment Figures

Radar chart comparing AEPO against baselines (PPO, GRPO, ARPO, etc.) across multiple benchmark categories.

Pilot experiment visualization of High-Entropy Rollout Collapse and Token Gradient Clipping.

Main Takeaways

AEPO consistently outperforms 7 mainstream RL algorithms across 14 datasets.
The method is particularly effective on complex, long-horizon tasks like GAIA and WebWalkerQA where diverse tool use is critical.
Analysis reveals AEPO maintains higher and more stable policy entropy throughout training compared to baselines, preventing collapse.
Ablation studies confirm both the dynamic rollout and the entropy-balanced policy update are necessary for optimal performance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO)
Language Model Entropy
Tree-structured search/rollout
Gradient Clipping

Key Terms

AEPO: Agentic Entropy-Balanced Policy Optimization—the proposed algorithm balancing rollout diversity and gradient updates.

Rollout: The process of generating trajectory samples from the policy during RL training.

Pass@k: A metric measuring the percentage of problems where at least one correct solution is found in k generated samples.

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same input.

ARPO: Agentic Rollout Policy Optimization—a prior entropy-guided baseline method.

Stop-gradient: An operation that prevents error gradients from flowing backward through a specific part of the computation graph.

Information Bottleneck: A theoretical framework used here to justify allocating sampling budget based on the information gain (entropy difference) between questions and tool outputs.