← Back to Paper List

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor Killian, Aviral Kumar
Carnegie Mellon University
arXiv (2026)
RL Reasoning Benchmark

📝 Paper Summary

LLM Post-training Reinforcement Learning (RL) Scaling Laws
This paper establishes scaling laws for RL post-training of LLMs, prescribing how to optimally distribute a fixed sampling compute budget between parallel rollouts, batch size, and sequential updates.
Core Problem
While pre-training has established scaling laws, practitioners lack a concrete workflow for allocating sampling compute during LLM RL post-training (e.g., choosing between more rollouts per prompt vs. more prompts per batch).
Why it matters:
  • RL scaling behavior is poorly understood due to the tight coupling between data collection (exploration) and optimization
  • Practitioners waste resources guessing hyperparameters like rollout counts (n) and batch sizes without knowing which trade-offs maximize performance under a fixed budget
Concrete Example: A practitioner with a budget of 1000 rollouts might naively set rollouts per prompt n=4 and train for many steps, but the paper shows that increasing n to 64 (and training for fewer steps) significantly improves coverage on hard problems.
Key Novelty
IsoCompute Scaling Laws for RL
  • Frames RL scaling as a constrained optimization problem over three resources: parallel rollouts per problem (n), problems per batch (Bp), and sequential update steps (M)
  • Identifies that the optimal number of rollouts (n) grows sigmoidally with the total compute budget, eventually saturating based on problem difficulty
  • Discovers distinct scaling mechanisms: easier problems benefit from larger n for solution sharpening (robustness), while harder problems require larger n for coverage (finding rare solutions)
Evaluation Highlights
  • Allocating more parallel rollouts (n) generally outperforms training longer (M) as compute increases, up to a saturation point (e.g., n=512 for Easy tasks)
  • Harder problems require smaller optimal rollout counts (n) than easy problems to avoid wasting compute on unsolvable prompts, prioritizing more sequential updates (M) instead
  • Square-root learning rate scaling (η ∝ √B) enables faster convergence and better stability than linear or constant scaling across batch sizes
Breakthrough Assessment
9/10
Provides the first comprehensive, empirically validated 'playbook' for RL compute allocation, offering predictive scaling laws analogous to Chinchilla for pre-training. The 120,000 GPU-hour scale is significant.
×