Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

📝 Paper Summary

Scaling Laws Reinforcement Learning (RL) Post-Training Mathematical Reasoning

This study establishes empirical scaling laws for RL post-training in reasoning, revealing that while larger models are more efficient, this efficiency saturates, and data reuse is effective when unique samples are scarce.

Core Problem

While pre-training scaling laws are well-understood, it remains unclear how to optimally scale reinforcement learning post-training for reasoning tasks with respect to model size, data volume, and compute budget.

Why it matters:

Allocating scarce computational resources for post-training requires precise guidelines on whether to scale model size, data, or training steps
Understanding saturation points prevents wasting compute on larger models that yield diminishing returns in efficiency
High-quality reasoning data is often limited, making data reuse strategies critical for practical applications

Concrete Example: In a compute-constrained scenario, simply using the largest model (72B) is not always optimal; a mid-sized model (32B) trained for more steps initially outperforms the 72B model due to a 'crossover' effect where the larger model's efficiency gains diminish (saturate) relative to the extra compute cost.

Key Novelty

Predictive Scaling Laws with Efficiency Saturation

Formulates a log-linear power law linking test loss to compute and data, introducing a specific saturation term k(N) that models how efficiency gains diminish as model size increases
Empirically validates that repeating small datasets (data reuse) is nearly as effective as using unique samples up to a threshold, challenging the assumption that unique data is strictly necessary for RL fine-tuning

Evaluation Highlights

Data reuse factor τ ≤ 25 maintains test loss performance comparable to using unique data, whereas τ = 100 leads to overfitting.
RL fine-tuned Qwen2.5-32B and 72B models match or surpass the dense Qwen3 counterparts of similar size on held-out math problems.
Predictive scaling laws fitted on 0.5B–32B models accurately forecast the learning efficiency of the 72B model.

Breakthrough Assessment

8/10

Provides the first comprehensive scaling law formulation for RL post-training in reasoning, identifying critical saturation effects and validating data reuse. The finding that larger models saturate in efficiency is a significant deviation from simple 'bigger is better' pre-training laws.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning fine-tuning of Large Language Models for mathematical reasoning

Inputs: Math problem q (with Chain-of-Thought prompt)

Outputs: Reasoning steps and final answer

Pipeline Flow

Policy Model (generates G responses)
Reward Function (scores responses)
GRPO Update (optimizes policy)

System Modules

Policy Model

Generate reasoning steps and answers for math problems

Model or implementation: Qwen2.5 (0.5B to 72B)

Reward Function

Evaluate correctness of generated answers to provide RL signal

Model or implementation: Rule-based script

GRPO Optimizer

Update model weights based on relative advantages within the group

Model or implementation: Group Relative Policy Optimization

Modeling

Base Model: Qwen2.5 (0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Maximize expected reward using importance sampling and group-normalized advantages.

Formally: Standard GRPO objective maximizing E[min(rho * A, clip(rho) * A)] - beta * KL.

Training Data:

50k mathematics problems sampled from guru-RL-92k dataset
Curriculum learning sorting by difficulty (decreasing pass rate)

Key Hyperparameters:

learning_rate: Not explicitly reported in the paper
batch_size: Not explicitly reported in the paper

Compute: Experiments cover 0.5B to 72B models; exact GPU hours not reported

Comparison to Prior Work

vs. Qwen3-Dense: Qwen2.5-72B-RL matches or surpasses Qwen3-Dense performance on held-out math sets, showing RL post-training efficacy
vs. Pre-training Scaling Laws (Kaplan/Hoffmann): Identifies an efficiency saturation term k(N) in RL post-training that is distinct from the smooth power laws of pre-training

Limitations

Efficiency saturation: Marginal gains in learning efficiency diminish significantly beyond 32B parameters.
Negative transfer: RL fine-tuning on math improves in-domain reasoning but degrades performance on logic puzzles (Zebra Puzzle).
Limited generalization: Performance on code (HumanEval) and science (SuperGPQA) shows only marginal gains compared to math.
Crossover effect: In compute-constrained settings, smaller models (32B) can initially outperform larger models (72B) due to being able to take more steps.

Reproducibility

All experiments run with VeRL framework (sheng2024hybridflow). Dataset is a subset of guru-RL-92k. Specific training scripts and hyperparameters (LR, batch size) are not explicitly detailed in the text provided.

📊 Experiments & Results

Evaluation Setup

RL fine-tuning on math problems followed by evaluation on held-out and OOD benchmarks

Benchmarks:

Held-out Math Set (Mathematical Reasoning) [New]
GSM8K (Grade School Math)
MATH-500 (Challenging Math)
HumanEval (Code Generation)
Zebra Puzzle (Logic Puzzle)

Metrics:

Test Loss (1 - Pass@1)
Pass@1
Statistical methodology: Experiments repeated three times; Average Standard Deviation and SEM reported in Appendix

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Data reuse experiments demonstrate that repeating a fixed dataset multiple times is effective up to a specific threshold before overfitting occurs.
Held-out Math Set	Test Loss stability	Not reported in the paper	Not reported in the paper	0
Held-out Math Set	Generalization	Not reported in the paper	Not reported in the paper	Negative

Experiment Figures

Compute-constrained scaling curves showing Test Loss vs. FLOPs for different model sizes.

Impact of data reuse factor (τ) on final model performance.

Main Takeaways

Larger models are consistently more compute and data efficient, but the marginal efficiency gains diminish (saturate) as model size increases.
The relationship between test loss, compute, and data follows a predictable power-law that holds across model scales (0.5B to 72B).
Data reuse is a highly effective strategy in data-constrained regimes; repeating data up to ~25 times yields similar performance to using unique data.
RL post-training is highly specialized: while it boosts math performance, it offers little benefit to code or science tasks and can harm general logic reasoning (Zebra Puzzle).

📚 Prerequisite Knowledge

Prerequisites

Scaling Laws (Kaplan/Hoffmann)
Reinforcement Learning (PPO/GRPO)
Language Model Post-Training

Key Terms

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing rewards across a group of outputs generated from the same prompt, avoiding the need for a separate value network

Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct

Test Loss: Defined in this paper as 1 - (Correct Answers / Total Answers), serving as a proxy for RL reward minimization

FLOPs: Floating Point Operations per Second—a measure of computational cost

CoT: Chain-of-Thought—a prompting strategy that encourages models to generate intermediate reasoning steps

Data Reuse: The strategy of training on the same data samples multiple times (epochs) rather than using new unique samples

Saturation: The phenomenon where increasing model size yields diminishing improvements in learning efficiency

Learning Efficiency k(N): A term in the paper's power-law equation representing how effectively a model of size N converts compute/data into loss reduction

VeRL: A large-scale Reinforcement Learning framework for LLMs used to run the experiments