← Back to Paper List

Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning

Zelin Tan, Hejia Geng, Mulei Zhang, Xiaohan Yu, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, Zhong-Zhi Li, Zaibin Zhang, Guibin Zhang, Chen Zhang, Zhenfei Yin, Lei Bai
University of Science and Technology of China, Shanghai AI Laboratory, University of Oxford
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Scaling Laws Reinforcement Learning (RL) Post-Training Mathematical Reasoning
This study establishes empirical scaling laws for RL post-training in reasoning, revealing that while larger models are more efficient, this efficiency saturates, and data reuse is effective when unique samples are scarce.
Core Problem
While pre-training scaling laws are well-understood, it remains unclear how to optimally scale reinforcement learning post-training for reasoning tasks with respect to model size, data volume, and compute budget.
Why it matters:
  • Allocating scarce computational resources for post-training requires precise guidelines on whether to scale model size, data, or training steps
  • Understanding saturation points prevents wasting compute on larger models that yield diminishing returns in efficiency
  • High-quality reasoning data is often limited, making data reuse strategies critical for practical applications
Concrete Example: In a compute-constrained scenario, simply using the largest model (72B) is not always optimal; a mid-sized model (32B) trained for more steps initially outperforms the 72B model due to a 'crossover' effect where the larger model's efficiency gains diminish (saturate) relative to the extra compute cost.
Key Novelty
Predictive Scaling Laws with Efficiency Saturation
  • Formulates a log-linear power law linking test loss to compute and data, introducing a specific saturation term k(N) that models how efficiency gains diminish as model size increases
  • Empirically validates that repeating small datasets (data reuse) is nearly as effective as using unique samples up to a threshold, challenging the assumption that unique data is strictly necessary for RL fine-tuning
Evaluation Highlights
  • Data reuse factor τ ≤ 25 maintains test loss performance comparable to using unique data, whereas τ = 100 leads to overfitting.
  • RL fine-tuned Qwen2.5-32B and 72B models match or surpass the dense Qwen3 counterparts of similar size on held-out math problems.
  • Predictive scaling laws fitted on 0.5B–32B models accurately forecast the learning efficiency of the 72B model.
Breakthrough Assessment
8/10
Provides the first comprehensive scaling law formulation for RL post-training in reasoning, identifying critical saturation effects and validating data reuse. The finding that larger models saturate in efficiency is a significant deviation from simple 'bigger is better' pre-training laws.
×