P$^2$ Law: Scaling Law for Post-Training After Model Pruning

📝 Paper Summary

Model Pruning Scaling Laws Post-training efficiency

P2Law is a scaling law that predicts the post-training loss of pruned LLMs based on model size, dataset size, and pruning rate, enabling cost-effective performance recovery.

Core Problem

Post-training is essential to recover performance in pruned LLMs, but there is no method to determine the optimal amount of data required, leading to either resource waste or insufficient recovery.

Why it matters:

Continual pre-training on large datasets is resource-intensive; knowing the saturation point saves significant compute
Existing scaling laws (like Chinchilla) do not account for pruning rates or the starting state of a pruned model
Developers need to balance the trade-off between post-training cost and the final performance of compressed models

Concrete Example: For a Llama-3.2-1B model, blindly using hundreds of billions of tokens for post-training might yield negligible gains after a certain point. P2Law predicts this saturation point (e.g., around 10^4 compute units) to stop training early without performance loss.

Key Novelty

P2Law (Post-Training Pruning Law)

Extends Chinchilla scaling laws by incorporating 'pruning rate' and 'pre-pruning loss' as fundamental variables
Identifies that smaller pruned models recover performance faster than larger ones relative to token count (Trend 1)
Proposes Average Slope Difference (ASD) as a metric for scaling laws, prioritizing the accuracy of the loss curve's slope (convergence rate) over absolute loss values

Evaluation Highlights

P2Law accurately fits post-training loss curves for Llama-3 and Qwen-2.5 models across depth, width, and 2:4 semi-structured pruning methods
Effectively generalizes to predict the loss of a 3B parameter model using only data derived from 0.5B and 1.5B models
Accurately predicts loss curves for higher pruning rates (e.g., 0.35) based on fits from lower rates (0.15, 0.25)

Breakthrough Assessment

8/10

Establishes the first formal scaling law for the post-training of pruned models. While niche compared to general pre-training laws, it provides a critical tool for efficient model compression pipelines.

⚙️ Technical Details

Problem Definition

Setting: Predicting the scalar loss value L of a pruned LLM during post-training

Inputs: Model size N, Number of post-training tokens D, Pruning rate rho, Pre-pruning loss L_0

Outputs: Predicted post-training loss L

Pipeline Flow

Pruning Phase: Apply Depth/Width/Semi-structured pruning to base LLM
Data Selection: Sample subset of pre-training data (SlimPajama)
Post-Training Phase: Train pruned model to recover performance
Curve Fitting: Fit P2Law parameters to observed loss curves

System Modules

Pruner

Reduce model size via removal of layers (depth), channels (width), or weights (2:4)

Model or implementation: Llama-3 or Qwen-2.5 variants

Post-Trainer

Fine-tune the pruned model to mitigate performance degradation

Model or implementation: Pruned Model

Modeling

Base Model: Llama-3 (1B, 3B, 8B) and Qwen-2.5 (0.5B, 1.5B, 3B)

Training Method: Continual Pre-training (Post-training)

Objective Functions:

Purpose: Standard language modeling loss.

Formally: Next-token prediction cross-entropy loss.

Trainable Parameters: All parameters (for structured pruning) or Masked updates (for 2:4 pruning)

Training Data:

Random selection from SlimPajama dataset
0.5B tokens for smaller models (1B, 0.5B, 1.5B)
1B tokens for larger models (3B, 8B)

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 262k tokens
gpu_configuration: 4 Nvidia A800-80G and 4 Nvidia A6000-48G

Compute: Total training process took 500 hours across all experiments

Comparison to Prior Work

vs. Chinchilla Scaling Law: P2Law adds pruning rate (rho) and pre-pruning loss (L0) as variables
vs. LLM-Streamline: P2Law provides a mathematical formula to predict exact loss values rather than just empirical observations

Limitations

Width pruning on Llama-3.1-8B exhibited anomalous behavior where depth pruning outperformed it, causing the law to fit poorly for this specific case
Experiments limited to Llama-3 and Qwen-2.5 families; generalization to other architectures not tested
Maximum post-training data size tested was 1B tokens; behavior at trillion-token scale is extrapolated

Reproducibility

Pruning and post-training used standard datasets (SlimPajama) and open models (Llama-3, Qwen-2.5). Code is not provided, but hyperparameters are detailed. Pruning methods are standard (SparseGPT, etc.).

📊 Experiments & Results

Evaluation Setup

Curve fitting error evaluation on post-training loss trajectories

Benchmarks:

SlimPajama (Loss) (Language Modeling)

Metrics:

R^2 (Coefficient of determination)
Huber Loss
ASD (Average Slope Difference)
Statistical methodology: Levenberg-Marquardt algorithm for non-linear least squares fitting

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Evaluation of different parameterization candidates for P2Law shows that L1 (the proposed form) provides the best fit across metrics.
Llama-3 Depth Pruning	ASD	4.67e-6	1.89e-6	-2.78e-6
Llama-3 Depth Pruning	R^2	0.8548	0.9922	+0.1374

Experiment Figures

The fitted P2Law curves vs actual loss for Llama-3 models under depth pruning

Normalized relative post-training loss curves for 2:4 semi-structured pruning

Main Takeaways

Smaller LLMs exhibit faster convergence in post-training loss compared to larger LLMs given the same pruning rate
Relative post-training loss follows a power-law relationship with the pruning rate
The proposed P2Law (L1 parameterization) generalizes well to unseen model sizes (predicting 3B from <2B models) and larger datasets
Width pruning on Llama-3.1-8B is an outlier where depth pruning is superior, contrary to trends in other models

📚 Prerequisite Knowledge

Prerequisites

Understanding of Neural Scaling Laws (Kaplan/Chinchilla)
Model Pruning techniques (Structured vs. Unstructured)
Post-training / Continual Pre-training

Key Terms

ASD: Average Slope Difference—a metric measuring the discrepancy between the slope of the predicted loss curve and the actual loss curve, used to ensure the scaling law captures convergence trends correctly

Depth Pruning: A structured pruning method that removes entire Transformer layers based on importance estimation

Width Pruning: A structured pruning method that reduces the number of embedding channels or attention heads

2:4 Semi-Structured Pruning: A pruning pattern where 2 out of every 4 weights in a group are zeroed, allowing for hardware acceleration

Relative post-training loss: The difference between the pruned model's current loss and the original model's loss before pruning

Chinchilla scaling law: A widely used empirical law stating that model performance scales as a power law with model size and training data size