← Back to Paper List

Part II: ROLL Flash -- Accelerating RLVR and Agentic Training with Asynchrony

Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Jiashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng
Alibaba Group, Shanghai Jiaotong University, Hong Kong University of Science and Technology
arXiv (2025)
RL Agent Reasoning

📝 Paper Summary

RL Post-Training Efficiency Large Language Model Systems
ROLL Flash accelerates reinforcement learning post-training by decoupling rollout and training stages into asynchronous parallel processes, using off-policy algorithms to maintain stability while eliminating idle time caused by long-tail responses.
Core Problem
Synchronous RL training suffers from severe GPU underutilization because the training step must wait for the rollout stage to finish, which is bottlenecked by the longest (long-tail) responses in the batch.
Why it matters:
  • Rollout accounts for over 70% of total training time in RL post-training
  • Long-tail responses can be 20x longer than the median, causing massive idle bubbles where GPUs sit waiting
  • Scaling GPU count in synchronous settings yields diminishing returns because decoding is memory-bandwidth bound and stragglers stall the entire cluster
Concrete Example: In a batch of prompts, if one mathematical reasoning task requires generating 20k tokens while others take 1k, in a synchronous setup, all training GPUs remain idle until the 20k-token generation finishes, wasting the majority of compute capacity.
Key Novelty
Asynchronous Producer-Consumer Architecture for RLVR
  • Decouples the 'rollout' (production of data) and 'training' (consumption of data) into separate, continuously running worker pools connected by a sample buffer
  • Introduces 'Async Ratio' to bound how far the rollout policy is allowed to lag behind the training policy, balancing throughput with data freshness
  • Utilizes fine-grained parallelism (prompt replication, queue scheduling) to overlap generation, environment interaction, and reward computation
Evaluation Highlights
  • Achieves 7.6x speedup over synchronous baseline with 8 GPUs on Qwen3-8B-Think model
  • Attains 2.24x higher throughput than synchronous baseline at 128 GPU scale for Qwen3-8B-Base
  • Delivers 2.72x speedup on ALFWorld and 1.81x on SWE-bench agentic tasks compared to synchronous execution
Breakthrough Assessment
9/10
Addresses the primary bottleneck (rollout latency) in scaling RLHF/RLVR. The shift to asynchronous training with stability guarantees is a critical system-level optimization for efficient large-scale post-training.
×