Fast and Controllable Post-training Sparsity: Learning Optimal Sparsity Allocation with Global Constraint in Minutes

📝 Paper Summary

Model Compression Post-training Sparsity

FCPTS enables fast, accurate post-training sparsity by learning optimal per-layer pruning thresholds via a differentiable bridge function based on kernel density estimation.

Core Problem

Existing post-training sparsity methods struggle with high sparsity rates because naive allocation ignores layer sensitivity, while retraining-based allocation is too slow and data-hungry for the post-training setting.

Why it matters:

Deploying deep networks on edge devices requires reducing memory and energy use without extensive retraining costs
Current post-training methods (like POT) crash in accuracy when sparsity exceeds 50%, limiting their practical utility
Manual hyperparameter tuning for sparsity allocation is inefficient and cannot guarantee optimal global constraints

Concrete Example: A naive post-training method might prune 80% of weights uniformly across all layers. If the first layer is highly sensitive, accuracy crashes. FCPTS automatically learns to prune the sensitive first layer less (e.g., 40%) and robust later layers more (e.g., 90%) to meet the global 80% target.

Key Novelty

Differentiable Bridge for Sparsity Allocation (FCPTS)

Establishes a mathematical bridge between the non-differentiable sparsity rate and the pruning threshold using Kernel Density Estimation (KDE)
Allows gradients to flow from the sparsity objective back to the pruning thresholds, enabling direct optimization of layer-wise sparsity rates via standard backpropagation
Optimizes sparsity allocation and weight reconstruction jointly in a net-wise manner rather than layer-by-layer, satisfying global sparsity constraints exactly

Architecture

The FCPTS framework pipeline illustrating the flow from dense weights to sparse weights via the differentiable bridge.

Evaluation Highlights

Over 30% accuracy improvement for ResNet-50 on ImageNet compared to state-of-the-art methods at 80% sparsity
Achieves 70% global sparsity on ResNet-18 with accuracy on par with the dense counterpart (Top-1 Acc 69.76% vs 69.76%)
Reduces processing time to minutes (e.g., ~30 mins for ResNet-18) compared to hours or days for retraining-based methods

Breakthrough Assessment

8/10

Significantly advances post-training sparsity by making the allocation process differentiable and fast. The large accuracy gains at high sparsity levels (80%) solve a major bottleneck in the field.

⚙️ Technical Details

Problem Definition

Setting: Post-training non-structural sparsity: Pruning a pre-trained model to a target global sparsity rate using a small calibration dataset without full retraining.

Inputs: Pre-trained dense neural network W, calibration dataset, target global sparsity rate r0

Outputs: Sparse neural network W_sparse satisfying the sparsity constraint

Pipeline Flow

Weight Distribution Estimation (KDE) -> Differentiable Threshold Calculation -> Sparsity Allocation Learning -> Weight Reconstruction

System Modules

KDE PDF Estimator

Estimates the probability density function of weights in a layer to enable differentiation

Model or implementation: Kernel Density Estimation with Gaussian kernel

Threshold Optimizer

Learns the optimal pruning threshold t_l for each layer by minimizing control and reconstruction losses

Model or implementation: Gradient-based optimization variable

Mask Generator

Generates binary masks based on learned thresholds

Model or implementation: Magnitude-based pruning function (Step function)

Novel Architectural Elements

Differentiable bridge connecting discrete sparsity rates to continuous pruning thresholds via KDE

Modeling

Base Model: Various standard architectures (ResNet-18/50, MobileNetV2, RegNet, ViT-Base)

Training Method: Gradient-based optimization of thresholds and weights on calibration data

Objective Functions:

Purpose: Enforce global sparsity constraint.

Formally: L_c = (1 - (sum(r_i * N_i) / sum(N_i)) / r_0)^2
Purpose: Minimize output difference between sparse and dense models.

Formally: L_rec = D_KL(f_dense(x) || f_sparse(x))
Purpose: Combined objective.

Formally: L = L_rec + lambda * L_c

Adaptation: Post-training optimization (reconstruction)

Trainable Parameters: Pruning thresholds t_l and weight values W (fine-tuned)

Training Data:

Calibration set (limited data, e.g., small subset of ImageNet)

Key Hyperparameters:

kernel_samples_n: 100
bandwidth_h: 0.5
optimization_time: ~30 minutes (ResNet-18)

Compute: Single NVIDIA RTX 3090 GPU, ~30 minutes for ResNet-18

Comparison to Prior Work

vs. POT: FCPTS uses net-wise optimization and learns allocation, POT is layer-wise and manual/heuristic
vs. STR/ProbMask: FCPTS is designed for post-training (limited data/time) and preserves original weight distribution better; STR damages weights by subtraction
vs. Uniform: FCPTS learns non-uniform allocation automatically

Limitations

Relies on the assumption that weight magnitude correlates with importance
KDE estimation introduces a trade-off between bias and variance (controlled by bandwidth h)
Requires a calibration dataset (though small), unlike data-free methods

Reproducibility

Code: https://github.com/ModelTC/FCPTS

Code is publicly available at https://github.com/ModelTC/FCPTS. The paper specifies the KDE hyperparameters (n=100, h=0.5) and the optimization objective clearly.

📊 Experiments & Results

Evaluation Setup

Image classification on standard benchmarks

Benchmarks:

ImageNet (Image Classification)
CIFAR-10/100 (Image Classification)
PASCAL VOC (Object Detection/Segmentation (mentioned as generalization task))

Metrics:

Top-1 Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ResNet-50 on ImageNet results showing FCPTS dominance at high sparsity levels compared to POT and retraining methods adapted for PTS.
ImageNet	Top-1 Accuracy	39.42	71.06	+31.64
ImageNet	Top-1 Accuracy	0.45	71.06	+70.61
ImageNet	Top-1 Accuracy	76.15	71.06	-5.09

Main Takeaways

FCPTS enables high sparsity rates (e.g., 80%) where previous post-training methods (POT) fail catastrophically.
Retraining-based methods (STR, ProbMask) do not transfer well to the post-training setting due to reliance on extensive training.
The method is extremely fast, taking only minutes to optimize, making it practical for rapid deployment.
Learned sparsity allocation is non-uniform, automatically preserving sensitive layers.

📚 Prerequisite Knowledge

Prerequisites

Neural network pruning (magnitude-based)
Post-training quantization/compression concepts
Kernel Density Estimation (KDE)
Gradient descent and backpropagation

Key Terms

PTS: Post-Training Sparsity—sparsifying a model using limited data and time without full retraining

KDE: Kernel Density Estimation—a method to estimate the probability density function of a random variable, used here to make sparsity differentiable

Sparsity Allocation: Deciding what percentage of weights to prune in each specific layer of the network

Bridge Function: The mapping designed in this paper that connects the pruning threshold t to the sparsity rate r via the weight distribution's PDF

Net-wise optimization: Optimizing parameters across the entire network simultaneously, as opposed to layer-by-layer sequential optimization

Control Loss: A loss term specifically designed to force the learned global sparsity rate to match the target constraint