← Back to Paper List

P$^2$ Law: Scaling Law for Post-Training After Model Pruning

Xiaodong Chen, Yuxuan Hu, Xiaokang Zhang, Yanling Wang, Cuiping Li, Hong Chen, Jing Zhang
Renmin University of China, Zhipu AI
arXiv (2024)
Pretraining Benchmark

📝 Paper Summary

Model Pruning Scaling Laws Post-training efficiency
P2Law is a scaling law that predicts the post-training loss of pruned LLMs based on model size, dataset size, and pruning rate, enabling cost-effective performance recovery.
Core Problem
Post-training is essential to recover performance in pruned LLMs, but there is no method to determine the optimal amount of data required, leading to either resource waste or insufficient recovery.
Why it matters:
  • Continual pre-training on large datasets is resource-intensive; knowing the saturation point saves significant compute
  • Existing scaling laws (like Chinchilla) do not account for pruning rates or the starting state of a pruned model
  • Developers need to balance the trade-off between post-training cost and the final performance of compressed models
Concrete Example: For a Llama-3.2-1B model, blindly using hundreds of billions of tokens for post-training might yield negligible gains after a certain point. P2Law predicts this saturation point (e.g., around 10^4 compute units) to stop training early without performance loss.
Key Novelty
P2Law (Post-Training Pruning Law)
  • Extends Chinchilla scaling laws by incorporating 'pruning rate' and 'pre-pruning loss' as fundamental variables
  • Identifies that smaller pruned models recover performance faster than larger ones relative to token count (Trend 1)
  • Proposes Average Slope Difference (ASD) as a metric for scaling laws, prioritizing the accuracy of the loss curve's slope (convergence rate) over absolute loss values
Evaluation Highlights
  • P2Law accurately fits post-training loss curves for Llama-3 and Qwen-2.5 models across depth, width, and 2:4 semi-structured pruning methods
  • Effectively generalizes to predict the loss of a 3B parameter model using only data derived from 0.5B and 1.5B models
  • Accurately predicts loss curves for higher pruning rates (e.g., 0.35) based on fits from lower rates (0.15, 0.25)
Breakthrough Assessment
8/10
Establishes the first formal scaling law for the post-training of pruned models. While niche compared to general pre-training laws, it provides a critical tool for efficient model compression pipelines.
×