Normalization and effective learning rates in reinforcement learning

📝 Paper Summary

Reinforcement Learning Continual Learning Optimization Dynamics

The paper proposes Normalize-and-Project (NaP), a method that decouples effective learning rates from parameter growth to prevent plasticity loss in non-stationary reinforcement learning.

Core Problem

In deep networks with normalization, parameter growth causes the effective learning rate to decay implicitly; in non-stationary settings like continual RL, this decay happens too quickly, preventing the agent from learning new tasks (loss of plasticity).

Why it matters:

Deep RL agents trained for long periods often lose the ability to learn from new data, a phenomenon known as loss of plasticity.
Existing solutions like resetting units or regularization often fail to address the root cause: the coupling between weight magnitude and optimization step size in scale-invariant networks.
Implicit learning rate schedules explain why weight decay can be harmful in value-based RL: it interferes with the necessary annealing of the effective learning rate.

Concrete Example: A Convolutional Neural Network trained on CIFAR-10 with cyclically re-randomized labels eventually stops learning new labels because its parameter norm grows so large that the effective learning rate drops to near zero, freezing the weights.

Key Novelty

Normalize-and-Project (NaP)

Combines layer normalization (to control feature statistics) with periodic weight projection (to control gradient scaling).
By forcing weights to stay on a fixed-norm sphere, NaP prevents the effective learning rate from decaying uncontrollably due to parameter growth, making the optimization schedule explicit.

Architecture

Conceptual workflow of the Normalize-and-Project (NaP) method.

Evaluation Highlights

Maintains trainability over 500 consecutive label re-randomization tasks on CIFAR-10, whereas standard networks suffer exploding Jacobian norms and performance collapse.
Outperforms a Rainbow agent baseline with freshly initialized parameters after 400M training frames (100M optimizer steps) on the Sequential Arcade Learning Environment.
Successfully trains 400M parameter transformer models on C4 and vision models on ImageNet, matching or slightly improving upon base model performance in stationary settings.

Breakthrough Assessment

8/10

Provides a fundamental theoretical explanation for plasticity loss in normalized networks and offers a simple, architectural solution that works across diverse domains (RL, Vision, Language).

⚙️ Technical Details

Problem Definition

Setting: Non-stationary Reinforcement Learning and Continual Learning

Inputs: Sequence of observations from potentially changing environments (e.g., sequential Atari games)

Outputs: Action values (Q-values) or Policy distributions

Pipeline Flow

Input Processing
Normalized Layer Block (Linear -> Norm -> Project -> Nonlinearity)
Output Generation

System Modules

Normalization Layer (Normalized Layer Block)

Standardizes pre-activations to mean-zero, unit-variance to maintain feature statistics.

Model or implementation: Layer Normalization (LayerNorm)

Weight Projection (Normalized Layer Block)

Rescales layer weights to a fixed norm radius to prevent implicit learning rate decay.

Model or implementation: Projection Operation

Novel Architectural Elements

Normalize-and-Project (NaP) protocol: Explicit coupling of pre-activation normalization with weight projection to a fixed radius (e.g., unit sphere) to decouple parameter norm from effective learning rate.

Modeling

Base Model: Evaluated on various architectures: ResNets (Vision), Transformers (Language), CNNs (RL/Rainbow)

Training Method: Normalize-and-Project (NaP) applied during standard training (RL or Supervised)

Objective Functions:

Purpose: Maintain constant parameter norm to fix Effective Learning Rate.

Formally: Project weights w such that ||w|| = R (fixed radius) periodically.

Adaptation: Applies to standard backpropagation training

Training Data:

Sequential Arcade Learning Environment (Atari)
CIFAR-10 (Cyclic label noise)
ImageNet
C4 dataset

Key Hyperparameters:

projection_interval: Evaluated at 1 step and 1000 steps (results nearly identical)
transformer_size: 400M parameters
training_frames_rl: 400M frames

Compute: Not reported in the paper

Reproducibility

No replication artifacts mentioned in the paper. Code URL is not provided in the text. Evaluation uses standard datasets (CIFAR, ALE, ImageNet, C4).

📊 Experiments & Results

Evaluation Setup

Evaluation on both stationary (standard supervised) and non-stationary (continual/sequential) learning tasks.

Benchmarks:

Sequential Arcade Learning Environment (ALE) (Continual Reinforcement Learning)
Cyclic CIFAR-10 (Synthetic Non-stationary Supervised Learning (Label Noise)) [New]
ImageNet (Image Classification)
C4 (Language Modeling)

Metrics:

Effective Learning Rate (ELR)
Parameter Norm
Jacobian Norm
Plasticity (ability to learn new tasks)
Test/Train Accuracy/Reward
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of Jacobian Norm and Parameter Norm evolution during cyclic CIFAR-10 training for Standard, Normalized, and NaP networks.

Main Takeaways

Implicit learning rate decay caused by parameter growth in normalized networks is a double-edged sword: it aids convergence in stationary tasks (acting as a schedule) but harms plasticity in non-stationary tasks by annealing the learning rate too quickly.
Weight decay can detrimentally affect value-based RL because it prevents the parameter norm from growing, thereby preventing the implicit learning rate decay required for stable value function learning.
NaP (Normalize-and-Project) robustly solves the plasticity loss problem in cyclic CIFAR-10 experiments, maintaining a stable Jacobian norm where standard networks diverge or collapse.
In the Sequential Arcade Learning Environment, a NaP agent outperforms a 'Fresh Baseline' (a new network initialized for each task) after 400M frames, demonstrating that it preserves the ability to learn as well as a fresh network.
NaP can be applied to large-scale stationary tasks (400M transformers on C4, ResNets on ImageNet) without harming performance, and in some cases slightly improving it.

📚 Prerequisite Knowledge

Prerequisites

Basics of Deep Reinforcement Learning (DQN, Rainbow)
Layer Normalization and its effect on gradients
Optimization dynamics (SGD, Adam)
Concept of Scale Invariance in neural networks

Key Terms

NaP: Normalize-and-Project—the proposed method of combining layer normalization with periodic projection of weights to a fixed radius to maintain a constant effective learning rate.

Effective Learning Rate (ELR): The actual step size in function space for a scale-invariant network; for normalized layers, ELR scales inversely with the squared parameter norm.

Plasticity: The ability of a neural network to adapt to new data or tasks after being trained on previous data; loss of plasticity refers to the inability to learn new information.

Scale-invariance: A property of a function where scaling the parameters by a constant factor does not change the output (e.g., f(cθ, x) = f(θ, x)), commonly induced by normalization layers.

Rainbow: A state-of-the-art value-based reinforcement learning agent that combines several improvements to DQN (Deep Q-Network), such as distributional RL and multi-step learning.

ALE: Arcade Learning Environment—a benchmark suite of Atari 2600 games used to evaluate reinforcement learning agents.

Neural Tangent Kernel: A kernel function that describes the evolution of a neural network during training in the infinite-width limit, often used to analyze trainability.

C4: Colossal Clean Crawled Corpus—a massive dataset of web text used for training large language models.

Saturated units: Neurons (like ReLUs) that are stuck outputting zero or a constant value, preventing gradient flow.