In value-based deep reinforcement learning, a pruned network is a good network

📝 Paper Summary

Deep Reinforcement Learning (RL) Neural Network Pruning Sparse Neural Networks

Gradual magnitude pruning enables value-based RL agents to scale effectively with network size, achieving significant performance gains with only a small fraction of the original parameters.

Core Problem

Deep RL agents struggle to utilize parameters effectively; simply scaling up network size often leads to performance degradation or instability, unlike in supervised learning.

Why it matters:

Current deep RL methods under-utilize network parameters (dormant neurons), wasting computational resources.
Scaling laws (getting better performance just by making networks bigger) have been difficult to replicate in RL due to instability and overfitting.
Standard dense architectures often hit a performance ceiling or degrade when width is increased.

Concrete Example: When a standard Rainbow agent's network width is increased by a factor of 4, its performance on Atari games drops significantly compared to the base width. In contrast, the pruned version's performance keeps climbing.

Key Novelty

Gradual Magnitude Pruning as a Scaling Enabler for RL

Apply gradual magnitude pruning to large, dense networks during training, removing weights with low magnitude until a high sparsity target (e.g., 95%) is reached.
Unlike dense networks which degrade when scaled up, pruned networks in RL monotonically improve as the base network size increases.
This technique works as a general 'drop-in' improvement for value-based methods (DQN, Rainbow, IQN) and offline RL (CQL), effectively acting like a dynamic architecture search.

Architecture

The polynomial schedule for sparsity over training steps.

Evaluation Highlights

+60% (DQN) and +50% (Rainbow) improvement in Human Normalized Score over standard dense baselines on Atari 100k benchmarks when using pruned, scaled networks.
Pruning maintains performance with only 1% of parameters (99% sparsity) and achieves significant gains at 95% sparsity compared to dense baselines.
In offline RL (CQL), wider pruned networks achieve +173% improvement over standard dense baselines on Atari datasets.

Breakthrough Assessment

8/10

Provides a compelling solution to the long-standing problem of scaling deep RL networks. The finding that pruning turns scaling from harmful to helpful is a significant reversal of conventional wisdom in RL.

⚙️ Technical Details

Problem Definition

Setting: Value-based Reinforcement Learning (Online and Offline) on the Arcade Learning Environment (ALE)

Inputs: High-dimensional state observations (pixels from Atari games)

Outputs: Action values Q(s, a) for discrete action spaces

Pipeline Flow

Environment Interaction (Data Collection)
Replay Buffer Storage
Network Update (Dense Gradient Descent)
Mask Update (Pruning low-magnitude weights)

System Modules

Base Agent

Estimates value functions (Q-values) to select actions

Model or implementation: Impala ResNet (15-layer) with varying width scales (1x to 6x)

Pruner

Periodically masks out weights with the lowest magnitudes during training

Model or implementation: Polynomial decay schedule (Zhu & Gupta, 2017)

Novel Architectural Elements

Integration of Gradual Magnitude Pruning into RL training loops as a mechanism to enable width scaling without overfitting
Using wide ResNet Impala backbones (up to 6x width) specifically combined with high sparsity (95%)

Modeling

Base Model: Impala ResNet (15 layers)

Training Method: Value-based RL (DQN/Rainbow/CQL) with Gradual Magnitude Pruning

Objective Functions:

Purpose: Minimize temporal difference error between predicted and target Q-values.

Formally: L(θ) = E[(r + γ max_a' Q(s', a'; θ') - Q(s, a; θ))^2]
Purpose: Enforce sparsity constraint.

Formally: Masking weights w such that |w| < threshold, following polynomial schedule s_t = s_F + (s_0 - s_F)(1 - (t - t_start)/(t_end - t_start))^3

Adaptation: Masks are updated iteratively during training steps t_start to t_end

Trainable Parameters: Sparse subset (e.g., 5%) of the full dense parameters

Key Hyperparameters:

target_sparsity: 0.95 (typically)
pruning_start_step: 20% of total training steps
pruning_end_step: 80% of total training steps
+ 5 more
pruning_frequency: Every 1000 steps (implied by standard pruning setups, not explicitly listed in text body)
network_width_scale: Values tested: 1, 2, 3, 4, 5, 6
optimizer: Adam
learning_rate: 0.0000625 (DQN default)
batch_size: 32

Compute: NVIDIA Tesla P100 GPUs. Experiments took approx 2 days (40M frames).

Comparison to Prior Work

vs. Dense Scaling: Pruning prevents performance degradation when widening networks, enabling positive scaling laws.
vs. Graesser et al.: This paper focuses on scaling up the base network to achieve better-than-baseline performance, rather than just matching baseline with fewer parameters.

Limitations

Pruning does not improve performance on original simple CNN architectures (Mnih et al., 2015), only on ResNets.
Pruning showed no gains in the extreme low-data regime (100k steps) unless training was extended.
Performance gains were less consistent in continuous control tasks (MuJoCo/SAC), helping in only 2 of 5 environments.

Reproducibility

Code: https://github.com/google/dopamine

Code is publicly available at https://github.com/google/dopamine. Relies on JaxPruner library. Hyperparameters are detailed in Appendix F. Experiments typically run for 40M frames (approx 10M agent steps).

📊 Experiments & Results

Evaluation Setup

Atari 2600 games via Arcade Learning Environment (ALE)

Benchmarks:

Atari 15 Games Subset (Discrete Control / Visual RL)
Atari 60 Games (Full Suite) (Discrete Control / Visual RL)
Offline RL (17 Atari games) (Offline RL)

Metrics:

Human Normalized Score
Interquartile Mean (IQM)
Statistical methodology: 95% stratified bootstrap confidence intervals reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaling experiments showing that while dense networks degrade with width, pruned networks improve.
Atari 15 Games	IQM Human Normalized Score	0.55	0.90	+0.35
Atari 15 Games	IQM Human Normalized Score	1.1	1.7	+0.6
Sparsity level analysis showing optimal performance at 95% sparsity.
Atari 15 Games	IQM Human Normalized Score	0.6	1.0	+0.4
Offline RL results demonstrating gains in fixed-dataset settings.
17 Atari Games (Offline)	IQM Human Normalized Score	0.4	1.1	+0.7

Experiment Figures

Performance (IQM) of DQN and Rainbow agents as a function of network width multiplier, comparing Dense vs. Pruned.

Effect of different sparsity levels (50% to 99%) on DQN performance.

Main Takeaways

Pruning enables monotonic performance improvements with network width, solving the scaling issue prevalent in dense deep RL.
A sparsity level of 95% consistently yielded the best performance across experiments.
The benefits of pruning generalize to various value-based algorithms (DQN, Rainbow, IQN, Munchausen) and offline RL (CQL).
Pruning acts as a regularizer, preventing the performance degradation typically seen when increasing the Replay Ratio (updates per step).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning fundamentals (MDPs, Q-learning, Bellman equation)
Deep Learning basics (CNNs, ResNets)
Network Pruning concepts (sparsity, magnitude pruning)

Key Terms

Gradual Magnitude Pruning: A technique where network weights are slowly set to zero during training based on their absolute value, following a schedule (e.g., polynomial decay).

Impala Architecture: A specific ResNet-based deep neural network architecture commonly used in RL, consisting of residual blocks.

Sparsity: The percentage of parameters in a neural network that are set to zero (inactive).

IQM: Interquartile Mean—a robust aggregate metric that calculates the mean of the middle 50% of scores, reducing the impact of outliers.

Replay Ratio: The number of gradient updates performed per environment step collected.

Offline RL: Training RL agents using a fixed dataset of previously collected interactions without further environment interaction.

Dormant Neurons: Neurons in a neural network that become inactive (zero output) during training and stop contributing to the network's function.