← Back to Paper List

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li, Kaiqiao Han, Chenyi Tong, Haoran Deng, Renliang Sun, Alexander Taylor, Yanqiao Zhu, Jason Cong, Yizhou Sun, Wei Wang
University of California, Los Angeles, University of Wisconsin–Madison
arXiv (2026)
Agent RL Reasoning Benchmark

📝 Paper Summary

RL-based Agent Benchmark
ARLArena identifies that stable agentic RL requires sequence-level clipping, fine-grained advantage estimation, and dynamic filtering, proposing SAMPO to unify these principles into a stable training algorithm.
Core Problem
Agentic Reinforcement Learning is highly unstable and prone to training collapse due to the multi-turn nature of interactions, invalid actions, sparse rewards, and non-stationary dynamics.
Why it matters:
  • Instability limits scalability to larger environments and longer interaction horizons essential for complex agent tasks
  • Current training outcomes are difficult to reproduce across runs, constraining systematic algorithmic research
  • Small deviations in early decisions cascade into degenerate rollouts, making credit assignment extremely noisy
Concrete Example: In ALFWorld, tolerant clipping methods like CISPO exhibit rapid early gains but suffer sudden collapse around step 130, where gradient norms explode and the valid-format ratio of actions drops sharply, ruining the policy.
Key Novelty
SAMPO (Stable Agentic Multi-turn Policy Optimization)
  • Decomposes policy gradient training into four dimensions: loss aggregation, importance sampling clipping, advantage design, and dynamic filtering to isolate stability factors
  • Identifies that 'tolerant' token-level clipping causes collapse while sequence-level clipping stabilizes training by constraining off-policy drift
  • Combines sequence-level clipping with fine-grained environmental advantages (unifying global and local signals) and dynamic trajectory filtering to prevent degenerate updates
Evaluation Highlights
  • SAMPO achieves 92.72% success rate on ALFWorld, outperforming the GRPO baseline (62.36%) by +30.36 percentage points
  • On Sokoban (planning task), SAMPO reaches 88.86% success, surpassing the strong GIGPO baseline (82.67%)
  • Outperforms proprietary models: Qwen3-4B trained with SAMPO (92.72%) beats GPT-5.2 (51.56%) and o3-based multi-agent systems (56.25%) on ALFWorld
Breakthrough Assessment
9/10
Provides a definitive, reproducible recipe for stabilizing agentic RL, which has notoriously been a 'black art'. The decomposition analysis is thorough, and the resulting method (SAMPO) shows massive gains over baselines.
×