Dropout Strategy in Reinforcement Learning: Limiting the Surrogate Objective Variance in Policy Optimization Methods

📝 Paper Summary

Policy Optimization Variance Reduction Proximal Policy Optimization (PPO)

D-PPO reduces the excessive variance of the surrogate objective in policy optimization by selectively dropping training samples that contribute to high variance based on a derived upper bound.

Core Problem

Policy optimization algorithms like PPO and TRPO use importance sampling to reuse data, but this causes the variance of the surrogate objective to grow quadratically with the objective value, leading to training instability.

Why it matters:

High variance in the surrogate objective destabilizes the policy update process, potentially leading to performance collapse or slow convergence.
Existing variance reduction techniques like baselines (Actor-Critic) or GAE focus on gradient estimation variance, but lack systematic analysis of the surrogate objective's variance itself.
Stable training is critical for Deep Reinforcement Learning in complex environments like Atari games where sample efficiency and robust convergence are required.

Concrete Example: In the Breakout environment, standard PPO may experience spikes in the surrogate objective variance during training (e.g., around 7 million steps), which correlates with unstable returns or slower learning compared to a method that actively limits this variance.

Key Novelty

Dropout-PPO (D-PPO)

Derives a theoretical upper bound for the variance of the importance-sampled surrogate objective, showing it grows quadratically with the objective's magnitude.
Identifies that variance can be reduced by maximizing a specific term (correlation between samples), leading to a strategy of dropping samples with small values for this term.
Implements a 'dropout' mechanism that discards a fixed ratio of training data (e.g., 20%) that contributes least to reducing the variance bound.

Architecture

The neural network structure and the data flow incorporating the dropout strategy.

Evaluation Highlights

+101.1% improvement in average return over PPO in the Enduro environment (194.5 vs 391.2).
+70.1% improvement in average return over PPO in the Breakout environment (79.6 vs 135.5).
Significantly reduced surrogate objective variance compared to PPO across multiple Atari environments (e.g., DemonAttack, Gravitar) throughout training.

Breakthrough Assessment

4/10

The paper provides a solid theoretical analysis of objective variance and a simple, effective fix (D-PPO). While the performance gains on Atari are strong, the method is an incremental modification to PPO rather than a paradigm shift.

⚙️ Technical Details

Problem Definition

Setting: Policy-based Reinforcement Learning using Importance Sampling

Inputs: State s, Action a, Reward r

Outputs: Policy π(a|s) and Value V(s)

Pipeline Flow

Data Collection (Interaction with Environment)
Advantage Estimation (GAE)
Dropout Selection (Filter samples)
Policy Update (PPO Optimization)

System Modules

Actor (Policy Network)

Outputs action probabilities given a state

Model or implementation: CNN (3 layers) + Fully Connected

Critic (Value Network)

Estimates state value V(s)

Model or implementation: CNN (3 layers) + Fully Connected

Dropout Mechanism

Calculates the variance-reduction term for each sample and discards the bottom r% of samples

Model or implementation: Matrix-based parallel computation

Novel Architectural Elements

Insertion of a Dropout Strategy module that filters the training batch based on a calculated variance proxy term (Δ_i) derived from the surrogate objective upper bound.

Modeling

Base Model: Custom CNN architecture (Nature DQN style)

Training Method: Proximal Policy Optimization (PPO) with Dropout

Objective Functions:

Purpose: Maximize expected return while keeping the policy close to the old one.

Formally: L^CLIP(θ) = E[min(r_t(θ)A_t, clip(r_t(θ), 1-ε, 1+ε)A_t)]
Purpose: Minimize the difference between predicted and actual returns.

Formally: L^VF = (V_θ(s_t) - V_target)^2
Purpose: Encourage exploration.

Formally: S[π_θ](s_t) (Entropy bonus)

Training Data:

Atari 2600 environments (Gym)
8 parallel actors collecting data

Key Hyperparameters:

learning_rate: 2.5e-4
batch_size: 2048
mini_batch_size: 512
+ 6 more
update_epochs: 4
gae_lambda: 0.95
discount_factor_gamma: 0.99
clipping_parameter_epsilon: 0.1
entropy_coefficient: 0.01
dropout_ratio_r: 0.2

Compute: 8 parallel actors; Training for 10 million steps

Comparison to Prior Work

vs. PPO: D-PPO adds a data filtering step (dropout) based on a theoretical variance bound, whereas PPO uses all collected data.
vs. TRPO: D-PPO uses the computationally cheaper PPO clipping objective rather than the hard KL constraint of TRPO.

Limitations

The dropout strategy introduces a new hyperparameter (dropout ratio r) that needs tuning.
Performance in the 'Boxing' environment was worse than baseline PPO, indicating it may not be universally beneficial for all dense reward structures.
The theoretical derivation relies on approximations (ignoring P_theta_old(s)) which might affect the tightness of the bound.

Reproducibility

No code URL provided. Hyperparameters are detailed in Table I. Network architecture described in text and Figure 1. Derivations for variance bounds provided in Lemma/Theorem sections.

📊 Experiments & Results

Evaluation Setup

Atari 2600 games via OpenAI Gym

Benchmarks:

Atari 2600 (Arcade Games (Discrete Action Space))

Metrics:

Average Return (Reward)
Surrogate Objective Variance
Statistical methodology: Experiments repeated with 5 different random seeds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
D-PPO generally outperforms PPO across most tested Atari environments, with significant gains in Breakout and Enduro.
Enduro	Average Return	194.552	391.222	+196.67
Breakout	Average Return	79.615	135.457	+55.842
DemonAttack	Average Return	4426.82	6184.817	+1757.997
Kangaroo	Average Return	1824.8	2792.167	+967.367
Boxing	Average Return	90.923	83.171	-7.752

Experiment Figures

Training curves (Average Return vs Steps) and Variance curves (Surrogate Objective Variance vs Steps) for PPO vs D-PPO.

Hyperparameter sensitivity analysis for the dropout ratio 'r'.

Main Takeaways

D-PPO achieves higher average returns in 7 out of 8 tested Atari environments compared to vanilla PPO.
The method successfully limits the variance of the surrogate objective, as evidenced by lower variance curves in the later stages of training for environments like Breakout and Enduro.
A dropout ratio of r=0.2 was empirically found to be effective, though the method introduces sensitivity to this hyperparameter.
The approach validates the theoretical finding that maximizing the correlation term between samples reduces overall objective variance.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning basics (MDPs, Policy Gradient)
Importance Sampling
Proximal Policy Optimization (PPO)
Variance and Expectation properties

Key Terms

PPO: Proximal Policy Optimization—a policy gradient method that constrains policy updates to a trust region using a clipped surrogate objective.

Surrogate Objective: An objective function used in PPO/TRPO involving the ratio of new to old policy probabilities, allowing data reuse via importance sampling.

Importance Sampling: A technique to estimate properties of a target distribution using samples from a different proposal distribution (e.g., using old policy data to update a new policy).

GAE: Generalized Advantage Estimation—a method to estimate the advantage function (how good an action is compared to average) that balances bias and variance.

Dropout Strategy: In this context, a mechanism to selectively discard specific training samples (transitions) within a batch to lower the variance of the estimator.

TRPO: Trust Region Policy Optimization—an algorithm that ensures monotonic policy improvement by solving a constrained optimization problem on the KL divergence.