The Art of Scaling Reinforcement Learning Compute for LLMs

📝 Paper Summary

Reinforcement Learning for LLMs Scaling Laws Reasoning Models

The paper establishes a sigmoidal scaling framework for RL to predict asymptotic performance and proposes ScaleRL, a recipe optimizing efficiency and stability for large-scale reasoning training.

Core Problem

Reinforcement learning for LLMs lacks predictive scaling laws comparable to pre-training, forcing practitioners to rely on expensive, ad-hoc trial-and-error without understanding how algorithms behave at scale.

Why it matters:

Frontier models like Deepseek-R1-Zero consume massive compute (100,000 GPU hours) for RL, making blind experimentation cost-prohibitive
Current academic research focuses on isolated algorithms that may work at small scales but fail to scale up (the 'Bitter Lesson')
There is no principled way to distinguish between algorithmic improvements that increase maximum performance versus those that merely speed up convergence

Concrete Example: A researcher might choose the DAPO algorithm because it performs well at 1,000 GPU hours. However, the paper shows that at 16,000 GPU hours, DAPO hits a lower performance ceiling (asymptote) compared to CISPO, wasting vast compute resources on a method that cannot scale.

Key Novelty

Predictive Sigmoidal Scaling for RL & ScaleRL Recipe

Models RL performance (pass rate) as a function of compute using a sigmoidal curve, parameterized by asymptotic ceiling (A) and efficiency (B), allowing extrapolation from short runs
Identifies 'ScaleRL', a specific combination of loss functions, normalization techniques, and system optimizations (like PipelineRL) that maximizes the asymptotic ceiling while maintaining stability
Demonstrates that common tricks (e.g., advantages normalization) often only improve efficiency (speed), while loss choice and precision determine the absolute performance ceiling

Architecture

Schematic of the Sigmoidal Scaling Law used to model RL performance.

Evaluation Highlights

Precision fix (FP32 at logits) improves asymptotic pass rate (A) significantly from 0.52 to 0.61 on verifiable math problems
ScaleRL recipe maintains predictive scaling on a massive 100,000 GPU-hour run, matching the trajectory extrapolated from the first ~8,000 hours
CISPO loss function consistently achieves a higher performance ceiling than the widely used DAPO baseline in large-scale regimes

Breakthrough Assessment

9/10

Establishes the first rigorous 'scaling law' analog for RL, backed by massive compute (400k GPU hours). The distinction between efficiency-improving and ceiling-raising interventions is a critical contribution for the field.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning for Reasoning (Chain-of-Thought)

Inputs: Prompt x sampled from data distribution D (verifiable math problems)

Outputs: Reasoning trace (enclosed in <think> tags) and final solution y

Pipeline Flow

Prompt Sampler (Training Data)
Generator GPUs (Inference/Rollout)
Trainer GPUs (Gradient Updates)

System Modules

Generator

Generate candidate reasoning traces and answers from the current or slightly stale policy

Model or implementation: 8B dense model (Llama-3 derivative implied) or Llama-4 17B MoE

Reward & Advantage Computer

Assign binary rewards and compute advantages for updates

Model or implementation: Deterministic Verifier

Novel Architectural Elements

Integration of forced length interruptions in the rollout phase to prevent reasoning explosion
PipelineRL-8 setup allowing trainers to be up to 8 steps ahead of generators for efficiency

Modeling

Base Model: 8B dense model (primary experiments), Llama-4 17B MoE (extension)

Training Method: ScaleRL (PipelineRL + CISPO)

Objective Functions:

Purpose: Optimize policy to maximize expected reward while staying close to old policy.

Formally: L_CISPO = E[ min( ρ_t A_t, sg(ρ_t) A_t ) ] subject to truncation constraints, where ρ_t is the importance sampling ratio.

Adaptation: Full fine-tuning (implied by RL context)

Trainable Parameters: All parameters (assumed standard for RLHF)

Training Data:

Polaris-53K dataset (math problems)
Held-out validation set of 1,000 prompts for curve fitting

Key Hyperparameters:

batch_size: 768 prompts (48 prompts * 16 generations)
sequence_length: 16,384 tokens (12,288 thinking + 2,048 solution + 2,048 prompt)
pipeline_lag_k: 8
+ 2 more
precision: FP32 for logits
curriculum_threshold: 0.9 pass rate (No-Positive-Resampling)

Compute: Total study >400,000 GPU-hours on Nvidia GB200s. Single large run: 100,000 GPU-hours.

Comparison to Prior Work

vs. GRPO/DAPO: ScaleRL uses CISPO loss and batch-level normalization, achieving higher asymptotic performance.
vs. Standard PPO: ScaleRL uses PipelineRL (async) for higher compute efficiency and 'interruptions' to manage length.
vs. R1-Zero/OpenAI o1 [not cited in paper]: Provides a public, reproducible recipe and scaling law analysis rather than a black-box report.

Limitations

Sigmoidal fitting is sensitive to the 'start time' (requires skipping early low-compute regime >1.5k hours)
Study focuses primarily on math reasoning; generalization to other domains (coding, creative writing) is less explored
Requires substantial compute (minimum 1.5k-3.5k GPU hours) just to establish the scaling curve for extrapolation

Reproducibility

Code not provided. Polaris-53K dataset cited. Implementation details for 'ScaleRL' (loss, normalization, curriculum) are explicitly detailed in text. Hyperparameters like batch size and sequence length are provided.

📊 Experiments & Results

Evaluation Setup

RL Training on verifiable math problems

Benchmarks:

Polaris-53k (Math Reasoning)

Metrics:

Pass Rate (average accuracy on validation set)
Asymptotic Pass Rate (A parameter in sigmoid fit)
Compute Efficiency (B parameter in sigmoid fit)
Statistical methodology: Curve fitting using sigmoidal function; Leave-one-out ablations

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation studies revealing the impact of specific design choices on asymptotic performance (A).
Polaris-53k (Validation)	Asymptotic Pass Rate (A)	0.52	0.61	+0.09
Polaris-53k (Validation)	Pass Rate (at ~16k GPU hours)	0.55	0.63	+0.08
Polaris-53k (Validation)	Pass Rate	0.65	0.65	0.00

Experiment Figures

Scaling curve of ScaleRL over 100,000 GPU hours compared to predicted trajectory.

Comparison of Loss Types (a) and Precision Fixes (b).

Main Takeaways

Compute efficiency (speed) and asymptotic performance (ceiling) are decoupled; many common tricks (PipelineRL, normalization) improve efficiency but not the ceiling.
Precision matters: Small numerical mismatches in token probabilities between generator and trainer (fixed by FP32) can cripple RL performance ceilings.
The 'Bitter Lesson' holds for RL: Methods that look better at low compute (e.g., aggressive clipping) may saturate early, while scalable methods (CISPO) continue improving.
ScaleRL recipe generalizes: Benefits transfer to larger batch sizes (2.5x), longer sequences (32k), and MoE architectures.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Policy Gradients)
Large Language Model Training (FSDP, KV Caches)
Scaling Laws (Power laws vs Sigmoidal curves)

Key Terms

ScaleRL: The authors' proposed RL training recipe combining PipelineRL, CISPO, interruptions, and specific normalization techniques

CISPO: Clipped Importance Sampling Policy Optimization—a loss function combining truncated importance sampling with vanilla policy gradient

DAPO: An RL algorithm using asymmetric clipping to manage updates, often used to prevent entropy collapse

PipelineRL: An asynchronous RL setup where generators stream data continuously while trainers update weights, reducing GPU idle time

GRPO: Group Relative Policy Optimization—a baseline RL method that normalizes rewards within a group of generations for the same prompt

Sigmoidal Scaling: Modeling performance R(C) = A / (1 + (C_mid / C)^B), where A is the asymptote and B is the scaling exponent

FP32: 32-bit floating point precision, found to be critical for logit computation to prevent numerical instability

Zero-variance filtering: Removing prompts from the loss calculation where all generated responses yield the same reward (zero advantage)

No-Positive-Resampling: A curriculum strategy where prompts are permanently removed from training once the model achieves a high pass rate (>= 0.9)