ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

📝 Paper Summary

RL-based Agent Benchmark

ARLArena identifies that stable agentic RL requires sequence-level clipping, fine-grained advantage estimation, and dynamic filtering, proposing SAMPO to unify these principles into a stable training algorithm.

Core Problem

Agentic Reinforcement Learning is highly unstable and prone to training collapse due to the multi-turn nature of interactions, invalid actions, sparse rewards, and non-stationary dynamics.

Why it matters:

Instability limits scalability to larger environments and longer interaction horizons essential for complex agent tasks
Current training outcomes are difficult to reproduce across runs, constraining systematic algorithmic research
Small deviations in early decisions cascade into degenerate rollouts, making credit assignment extremely noisy

Concrete Example: In ALFWorld, tolerant clipping methods like CISPO exhibit rapid early gains but suffer sudden collapse around step 130, where gradient norms explode and the valid-format ratio of actions drops sharply, ruining the policy.

Key Novelty

SAMPO (Stable Agentic Multi-turn Policy Optimization)

Decomposes policy gradient training into four dimensions: loss aggregation, importance sampling clipping, advantage design, and dynamic filtering to isolate stability factors
Identifies that 'tolerant' token-level clipping causes collapse while sequence-level clipping stabilizes training by constraining off-policy drift
Combines sequence-level clipping with fine-grained environmental advantages (unifying global and local signals) and dynamic trajectory filtering to prevent degenerate updates

Evaluation Highlights

SAMPO achieves 92.72% success rate on ALFWorld, outperforming the GRPO baseline (62.36%) by +30.36 percentage points
On Sokoban (planning task), SAMPO reaches 88.86% success, surpassing the strong GIGPO baseline (82.67%)
Outperforms proprietary models: Qwen3-4B trained with SAMPO (92.72%) beats GPT-5.2 (51.56%) and o3-based multi-agent systems (56.25%) on ALFWorld

Breakthrough Assessment

9/10

Provides a definitive, reproducible recipe for stabilizing agentic RL, which has notoriously been a 'black art'. The decomposition analysis is thorough, and the resulting method (SAMPO) shows massive gains over baselines.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn Agentic Reinforcement Learning where a policy interacts with an environment over K turns to generate a trajectory of (state, action) pairs

Inputs: Initial user prompt x(1) and environment state s(1)

Outputs: Multi-turn trajectory τ = (x(1), y(1), ..., x(K), y(K)) ending in task success or failure

Pipeline Flow

Environment Interaction (Rollout)
Trajectory Segmentation
Advantage Estimation
Policy Update (SAMPO)

System Modules

Agent Policy (Interaction)

Generates multi-turn actions/responses based on history

Model or implementation: Qwen3-4B (base or SFT variant)

Format Enforcer (Interaction)

Checks output structure (e.g., <think>, <action> tags) and applies dense penalties for violations

Model or implementation: Rule-based regex

Dynamic Filter (Optimization)

Removes trajectories with degenerate advantages (all-success or all-fail groups) to stabilize gradients

Model or implementation: Algorithmic filter

SAMPO Optimizer (Optimization)

Updates policy weights using sequence-level clipped objective and fine-grained advantages

Model or implementation: Gradient Descent Optimizer

Novel Architectural Elements

Unified SAMPO architecture integrating three distinct stability mechanisms: sequence-level IS clipping, hybrid global/local advantage estimation, and dynamic trajectory filtering

Modeling

Base Model: Qwen3-4B (and Qwen3-8B in appendix)

Training Method: SAMPO (Stable Agentic Multi-turn Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward while keeping policy updates stable.

Formally: L(θ) = E[min(s_i(θ) A', clip(s_i(θ), 1±ε) A')]
Purpose: Define fine-grained advantage combining global and local signals.

Formally: A' = A_i + ω * A_step(y_step)
Purpose: Filter uninformative or destabilizing trajectories.

Formally: 0 < |{y | is_equivalent(a, y)}| < G

Adaptation: Full parameter update

Trainable Parameters: All parameters of Qwen3-4B

Training Data:

Self-generated rollouts from the model itself in the environment (SFT data created by filtering high-scoring rollouts)

Key Hyperparameters:

kl_coefficient: 0.05 (for stability analysis)
clip_epsilon: Standard ranges (e.g., 0.1-0.2)
mini_batch_size: Varies (e.g., 1024 for off-policy tests)
+ 2 more
rollout_batch_size: 128 (low off-policy), 1024 (high off-policy)
learning_rate: Not explicitly reported in the paper body (likely standard for size)

Compute: NVIDIA H200 or B200 GPUs used

Comparison to Prior Work

vs. GRPO: SAMPO adds sequence-level clipping, fine-grained advantage, and dynamic filtering; GRPO uses token-mean aggregation and simple group advantage
vs. SAPO/CISPO: SAMPO uses sequence-level clipping which prevents collapse; SAPO/CISPO use tolerant clipping which the paper shows leads to collapse in agentic settings
vs. GIGPO: SAMPO incorporates dynamic filtering and sequence-level clipping on top of fine-grained advantage; GIGPO focuses primarily on the advantage term

Limitations

Sequence-mean-token-mean loss aggregation can degrade performance on tasks with high length variance like math reasoning
Dynamic filtering benefits are inconsistent unless combined with diverse advantage signals (like GIGPO's)
Training is sensitive to off-policy staleness; high off-policy ratios degrade performance

Reproducibility

Code: https://github.com/WillDreamer/ARL-Arena

Code publicly available. Clean testbed recipe provided (BC + Format Penalty + KL). Hyperparameters for baselines tuned via grid search until stability criteria met.

📊 Experiments & Results

Evaluation Setup

Multi-turn agentic interaction across diverse environments

Benchmarks:

ALFWorld (Embodied text adventure / planning)
WebShop (Web navigation and shopping)
Sokoban (Puzzle solving / spatial reasoning)
TIR Math (Tool-integrated mathematical reasoning)

Metrics:

Success Rate
Task Score
Pass@4 (for Math)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main performance comparison showing SAMPO's dominance across different agentic tasks compared to various PO baselines.
ALFWorld	Success Rate	62.36	92.72	+30.36
WebShop	Success Rate	75.32	88.37	+13.05
Sokoban	Success Rate	57.71	77.73	+20.02
TIR Math (AIME)	Pass@4	49.96	60.21	+10.25
Ablation on Importance Sampling strategies reveals collapse in tolerant clipping methods versus stability in sequence-level clipping.
ALFWorld	Success Rate	62.36	25.16	-37.20
ALFWorld	Success Rate	62.36	78.61	+16.25
Stabilization strategies applied to collapsing methods (SAPO/CISPO).
ALFWorld	Success Rate	54.42	78.88	+24.46

Main Takeaways

Tolerant clipping (SAPO/CISPO) induces training collapse in agentic settings due to accumulation of negative-advantage sequences with low IS ratios; sequence-level clipping (GSPO/SAMPO) is required for stability.
Fine-grained advantage design (GIGPO/SAMPO) that incorporates environment state consistently improves performance over simple group relative advantage.
Dynamic filtering is beneficial primarily when combined with diverse advantage signals; with simple advantages (GRPO), it can accidentally filter out useful format-correction signals.
SAMPO's unified approach (sequence clipping + fine advantage + filtering) outperforms any single dimension optimization alone.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients, PPO)
Language Model Post-training (SFT, RLHF)
Importance Sampling

Key Terms

ARL: Agentic Reinforcement Learning—training LLM agents to solve multi-step interactive tasks via reinforcement learning

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing multiple outputs for the same input, avoiding a separate value network

Importance Sampling (IS): A technique to estimate properties of a target distribution using samples from a different proposal distribution, weighing samples by the ratio of their probabilities

Clipping: Constraining the policy update ratio (new policy / old policy) to a small range (e.g., 0.9 to 1.1) to prevent destructively large updates

PPO: Proximal Policy Optimization—a standard RL algorithm that uses clipping to ensure stable policy updates

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution

Behavior Cloning (BC): Supervised learning where the agent learns to mimic expert demonstrations or high-quality self-generated trajectories

SFT: Supervised Fine-Tuning—training the model on labeled data before RL

Off-policy staleness: The discrepancy that arises when the policy being updated has drifted significantly from the policy that generated the training data

Pass@k: A metric measuring the probability that at least one correct solution is found in k generated samples