Uncovering a Winning Lottery Ticket with Continuously Relaxed Bernoulli Gates

📝 Paper Summary

Neural Network Pruning Sparse Neural Networks

This paper proposes a method to discover strong lottery tickets—sparse subnetworks that perform well without weight training—by optimizing continuously relaxed Bernoulli gates while keeping the original network weights frozen.

Core Problem

Finding 'strong lottery tickets' (sparse subnetworks that work well without training) currently relies on the Edge-Popup algorithm, which uses non-differentiable score-based selection, leading to inefficient optimization and poor scalability.

Why it matters:

Over-parameterized models incur prohibitive memory and computational costs, limiting deployment on resource-constrained devices
Current methods like Edge-Popup struggle to scale to larger architectures due to reliance on non-differentiable gradient estimators
Efficiently finding strong lottery tickets could allow high-performance inference using only a fraction of a model's parameters without ever training the weights

Concrete Example: When using Edge-Popup to find a strong lottery ticket in a ResNet50, the algorithm must use a non-differentiable estimator to select edges based on scores, resulting in only ~50% sparsity for good accuracy. The proposed method uses differentiable gates to achieve >90% sparsity at comparable accuracy.

Key Novelty

Continuously Relaxed Bernoulli Gates for Strong Lottery Tickets

Applies a learnable mask (gate) to every weight in a randomly initialized network, where the mask values are drawn from a continuous relaxation of the Bernoulli distribution
Allows standard gradient descent to optimize the probability of each weight being active, even though the weights themselves are never updated
Enables end-to-end differentiable optimization of the network structure (sparsity) alongside an L0 regularization term, avoiding the need for straight-through estimators

Evaluation Highlights

Achieves 91.5% sparsity on ResNet50 (CIFAR-10) with 83.1% accuracy, nearly double the sparsity of Edge-Popup at comparable performance
Discovers the first known Strong Lottery Tickets for Vision Transformers (ViT-base), retaining 90% sparsity with 76% accuracy without weight training
Outperforms prior strong lottery ticket methods on LeNet-300-100 by 11 percentage points in accuracy (96% vs 85%)

Breakthrough Assessment

8/10

Significantly improves upon the standard Edge-Popup algorithm by making the process differentiable, yielding much higher sparsity. Successfully extends strong lottery tickets to Transformers for the first time.

⚙️ Technical Details

Problem Definition

Setting: Finding a binary mask M for a randomly initialized network W such that the sparse subnetwork W ⊙ M achieves high accuracy without updating W.

Inputs: Input data x (e.g., images)

Outputs: Class predictions

Pipeline Flow

Input Layer
Frozen Randomly Initialized Layers masked by Learnable Stochastic Gates
Output Layer

System Modules

Base Network Weights

Provide the fundamental random features/transformations; these are frozen at initialization

Model or implementation: ResNet50 / Wide-ResNet50 / ViT-base / Swin-T / LeNet

Gating Network

Learn the probability of each weight being active via differentiable parameters

Model or implementation: Element-wise gating parameters matching base network dimension

Novel Architectural Elements

Application of continuously relaxed Bernoulli gates (Stochastic Gates) specifically to the task of finding Strong Lottery Tickets (frozen weights)
End-to-end differentiable pipeline for SLT discovery without straight-through estimators

Modeling

Base Model: ResNet50, Wide-ResNet50, ViT-base, Swin-T, LeNet-300-100

Training Method: Gradient-based optimization of gating parameters only (weights frozen)

Objective Functions:

Purpose: Minimize classification error while encouraging sparsity.

Formally: L = L_class(G(x), y) + lambda * L_0(expected_gates)
Purpose: Differentiable L0 regularization.

Formally: Expected number of active gates, calculated using the CDF of the standard Gaussian distribution on the gating parameters

Trainable Parameters: Only the gating parameters mu (one per weight)

Training Data:

MNIST
CIFAR-10

Key Hyperparameters:

sigma: 0.5 (fixed noise standard deviation)
lambda: 0.1 (LeNet), 0.05 (ResNet50), 0.08 (Transformers)
learning_rate: 1e-3
+ 3 more
optimizer: Adam
initialization_mu: 0.5
epochs: 100 (FCNs), 50 (CNNs), 30 (Transformers)

Compute: Single NVIDIA RTX 3090 GPU

Comparison to Prior Work

vs. Edge-Popup: Uses fully differentiable relaxed gates instead of score-based top-k selection with straight-through estimators
vs. Edge-Popup: Achieves significantly higher sparsity (90% vs 50%) for similar accuracy
vs. Standard Pruning: Does not train weights at all (Strong Lottery Ticket setting)
+ 1 more
vs. Hard-Concrete: Uses Gaussian-based relaxations (Stochastic Gates) for lower variance gradients

Limitations

No direct comparison to other SLT methods for Transformers as none existed previously
Experiments limited to smaller datasets (MNIST, CIFAR-10) rather than ImageNet
Relies on specific initialization (Scaled Kaiming Normal) for optimal results
Hyperparameter lambda (regularization strength) requires grid search per architecture

Reproducibility

Code availability is not provided in the paper text. Hyperparameters (lambda, sigma, lr) are explicitly listed. Method relies on standard datasets (MNIST, CIFAR-10).

📊 Experiments & Results

Evaluation Setup

Pre-training sparsification (finding masks for frozen random weights)

Benchmarks:

MNIST (Image Classification)
CIFAR-10 (Image Classification)

Metrics:

Top-1 Accuracy
Sparsity (percentage of pruned weights)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MNIST (LeNet-300-100)	Top-1 Accuracy	85	96	+11
CIFAR-10 (ResNet50)	Sparsity (%)	50	91.5	+41.5
CIFAR-10 (ResNet50)	Top-1 Accuracy	83.0	83.1	+0.1
Results on Transformer architectures (ViT and Swin-T) demonstrate the first successful identification of Strong Lottery Tickets in attention-based models.
CIFAR-10 (ViT-base)	Top-1 Accuracy	Not reported in the paper	76	Not reported in the paper
CIFAR-10 (Swin-T)	Top-1 Accuracy	87.26	80	-7.26

Experiment Figures

Accuracy vs. Sparsity trade-off for LeNet-300-100 on MNIST

Per-layer sparsity rates for ResNet50 on CIFAR-10

Robustness of SLT identification when base network size is reduced

Main Takeaways

Differentiable relaxed gates allow finding much sparser strong lottery tickets (90%+) than prior score-based methods (50%) for similar accuracy.
Method works consistently across FCNs, CNNs (ResNet, WideResNet), and Transformers (ViT, Swin), showing architectural generalization.
For ResNet50, layers are sparsified non-uniformly: later layers are pruned much more heavily than early layers, preserving low-level feature extraction.
Wider networks (Wide-ResNet50) provide a richer search space, yielding higher accuracy (88%) than standard ResNet50 (83.1%) at similar high sparsity (~90%).

📚 Prerequisite Knowledge

Prerequisites

Lottery Ticket Hypothesis (LTH)
Neural Network Pruning
Gradient Descent optimization

Key Terms

Strong Lottery Tickets (SLTs): Subnetworks within randomly initialized neural networks that achieve high accuracy without any weight training, only by pruning

Weak Lottery Tickets: Subnetworks that require further training of weights to achieve optimal performance

Edge-Popup: The primary existing algorithm for finding SLTs, which assigns a score to each weight and selects the top-k scores

Continuously Relaxed Bernoulli Gates: A method to approximate binary (0/1) selection gates with continuous variables (0 to 1) to make the selection process differentiable

L0 regularization: A penalty term that counts the number of non-zero parameters in a model; usually non-differentiable, but approximated here

Concrete distribution: A continuous distribution that approximates discrete random variables (like Bernoulli) to allow for backpropagation

Straight-Through Estimator: A heuristic often used to estimate gradients for discrete variables by passing gradients through a threshold function as if it were an identity function

Sparsity: The percentage of weights in a neural network that are set to zero

Mask: A binary matrix applied to the weights of a network to select which connections are active

Scaled Kaiming Normal: A specific weight initialization strategy that has been found to be critical for the existence of strong lottery tickets