The Dormant Neuron Phenomenon in Deep Reinforcement Learning

📝 Paper Summary

Deep Reinforcement Learning Neural Network Dynamics Training Stability

Deep RL agents suffer from increasing numbers of inactive 'dormant' neurons during training due to non-stationary targets; the ReDo algorithm mitigates this by periodically recycling these neurons to maintain network expressivity.

Core Problem

Deep RL networks progressively lose expressivity during training as neurons become 'dormant' (permanently inactive), a phenomenon exacerbated by non-stationary targets and high replay ratios.

Why it matters:

Dormant neurons reduce the effective capacity of the network, preventing it from fitting new targets as the policy evolves
The phenomenon limits the ability to scale up the replay ratio (updates per environment step), which is crucial for sample efficiency but typically leads to performance collapse
Standard periodic network resets are too drastic, causing the agent to 'forget' learned behaviors and requiring costly recovery time

Concrete Example: In the game DemonAttack, a DQN agent's network sees the percentage of dormant neurons rise steadily throughout training. If this agent is then used to learn a new task (or fine-tuned), it fails to improve compared to a randomly initialized network because its capacity is effectively locked.

Key Novelty

Recycling Dormant Neurons (ReDo)

Periodically identifies neurons with near-zero activation scores across a batch
Resets the incoming weights of these dormant neurons to their initial distribution, effectively 'waking them up' to learn new features
Sets outgoing weights to zero to ensure the recycling step does not immediately disrupt the network's current output or performance

Architecture

The ReDo logic loop integrated into training

Evaluation Highlights

ReDo prevents performance collapse in DQN when scaling replay ratio to 2, effectively maintaining performance where standard DQN fails
Improves Interquantile Mean (IQM) scores on Atari 100K with DrQ(ε) at high replay ratios (e.g., ratio 8), outperforming the baseline
Reduces the fraction of dormant neurons significantly compared to standard training, correlating with improved ability to fit value functions

Breakthrough Assessment

8/10

Identifies a fundamental pathology in Deep RL (dormant neurons) and provides a simple, effective fix that enables better scaling of replay ratios. The analysis is thorough across multiple domains.

⚙️ Technical Details

Problem Definition

Setting: Value-based Reinforcement Learning (DQN, DrQ) and Actor-Critic (SAC) in discrete and continuous control environments

Inputs: State observations s (images or vectors)

Outputs: Action values Q(s, a) or policy actions

Pipeline Flow

Data Collection (Interaction)
Gradient Update (Optimization)
Dormancy Check (Periodic)
Recycling (Conditional)

System Modules

Q-Network / Policy Network

Approximates value functions or policies

Model or implementation: CNN (Nature DQN architecture) or ResNet (IMPALA architecture)

Dormancy Checker (Regularization (ReDo))

Computes normalized activation scores for all neurons to identify those below threshold τ

Model or implementation: Algorithmic check

Recycler (Regularization (ReDo))

Resets weights of identified dormant neurons

Model or implementation: Weight re-initialization

Novel Architectural Elements

Dynamic neuron recycling mechanism integrated into the training loop
Zero-initialization of outgoing weights for recycled neurons to preserve function output continuity

Modeling

Base Model: DQN (Nature CNN), DrQ(ε), SAC

Training Method: Online Reinforcement Learning with Periodic Neuron Recycling

Objective Functions:

Purpose: Minimize temporal difference error.

Formally: L = (Q_theta(s,a) - Target)^2
Purpose: Maximize entropy (SAC only).

Formally: J(π) = E[Q(s,a) - α log π(a|s)]

Key Hyperparameters:

redo_threshold_tau: 0.1
recycling_schedule: Periodic (frequency not explicitly detailed in summary, implied regular)
replay_ratios: 0.25, 0.5, 1, 2, 4, 8 (varied by experiment)
+ 2 more
DQN_training_frames: 10M (for replay ratio experiments) to 200M (standard)
DrQ_training_steps: 400K

Compute: Not reported in the paper

Comparison to Prior Work

vs. Periodic Resets: ReDo resets only specific inactive neurons rather than the whole layer/network, preserving learned features.
vs. Supervised Learning: Highlights that dormant neurons are unique to RL's non-stationary targets (SL with fixed targets sees dormancy decrease).
vs. Standard Dropout [not cited in paper]: ReDo permanently resets weights based on inactivity, whereas Dropout temporarily masks active neurons to prevent co-adaptation.

Limitations

Effectiveness primarily demonstrated on value-based methods; impact on other RL paradigms less exhaustive
Requires tuning the dormancy threshold τ
Does not completely eliminate the replay ratio instability, only mitigates it
ReLU-specific analysis suggests the phenomenon persists but is slightly different with other activations

Reproducibility

Code: https://github.com/google/dopamine/tree/master/dopamine/labs/redo

📊 Experiments & Results

Evaluation Setup

Online Reinforcement Learning on standard benchmarks

Benchmarks:

Arcade Learning Environment (Atari 2600) (Discrete control from pixels)
Atari 100K (Sample-efficient discrete control)
MuJoCo (Continuous control)

Metrics:

Interquantile Mean (IQM) of scores
Percentage of dormant neurons
Overlap coefficient of dormant neuron sets
Statistical methodology: 95% stratified bootstrap confidence intervals; IQM aggregation

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ReDo enables effective training at higher replay ratios where standard DQN collapses.
Atari (17 games)	IQM Score	0.1	1.0	+0.9
Atari 100K	IQM Score	0.8	1.1	+0.3
DemonAttack (DQN)	Dormant Neuron Fraction	0.35	0.05	-0.30

Experiment Figures

Percentage of dormant neurons increasing over training steps for DQN

Comparison of dormant neurons in Supervised Learning with Fixed vs. Non-stationary targets

IQM Performance vs Replay Ratio for DQN and DrQ

Main Takeaways

The dormant neuron phenomenon is driven by target non-stationarity, not input non-stationarity (confirmed by fixed-target experiments).
Dormant neurons do not recover on their own; once inactive, they tend to stay inactive (high overlap coefficient).
Simply pruning dormant neurons does not hurt performance, proving they are useless, but recycling them improves performance, proving they are potential capacity.
Higher replay ratios accelerate the creation of dormant neurons, explaining the instability of high-RR training.
ReDo allows for more aggressive updates (higher replay ratio) without the typical performance penalty.

📚 Prerequisite Knowledge

Prerequisites

Deep Q-Networks (DQN)
Replay Ratio (gradient updates per environment step)
Neural network initialization and activation functions (ReLU)
Non-stationarity in RL (changing targets)

Key Terms

Dormant Neuron: A neuron whose normalized activation score across a batch falls below a threshold τ (often 0), rendering it effectively inactive.

Replay Ratio: The number of gradient descent updates performed for every step taken in the environment.

ReDo: Recycling Dormant neurons—the proposed algorithm to reinitialize inactive neurons during training.

Non-stationary targets: In RL, the target values for the loss function change over time because they depend on the evolving network parameters (bootstrapping).

DrQ(ε): Data-regularized Q-learning, a sample-efficient RL algorithm using image augmentation.

Interquantile Mean (IQM): A robust aggregate metric that calculates the mean of the middle 50% of runs, reducing sensitivity to outliers.