No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO

📝 Paper Summary

Reinforcement Learning Policy Optimization Representation Learning

PPO agents suffer from performance collapse because non-stationarity degrades feature rank, rendering the trust region mechanism ineffective; regularizing feature dynamics prevents this failure.

Core Problem

Deep RL networks trained under non-stationarity lose plasticity (ability to learn) and suffer representation collapse, causing performance to degrade irrecoverably.

Why it matters:

Neural networks in continuous learning settings eventually stop learning or diverge, limiting long-term training
Standard trust region mechanisms (like PPO clipping) fail to prevent this collapse when representations degrade
Prior work identified this in off-policy value-based methods, but the connection to on-policy trust region failure was unknown

Concrete Example: In Atari or MuJoCo environments, a PPO agent might improve initially, but as its internal features collapse to a lower rank, the clipping mechanism fails to constrain updates effectively, causing the policy's performance to suddenly crash to zero.

Key Novelty

Proximal Feature Optimization (PFO)

Identifies a causal link between feature rank collapse (loss of representation diversity) and the failure of PPO's trust region (clipping mechanism)
Proposes PFO as an auxiliary loss that regularizes the change in the network's pre-activations (features) to maintain plasticity and ensure the trust region remains effective

Architecture

Pseudocode for PPO-Clip training loop

Evaluation Highlights

First empirical demonstration that on-policy PPO agents in Atari and MuJoCo suffer from the same feature rank collapse and capacity loss previously observed in value-based methods
Establishes that PPO's clipping mechanism becomes ineffective under poor representations, leading to performance collapse regardless of critic quality
Demonstrates that regularizing representation dynamics (via PFO) prevents both feature collapse and performance collapse

Breakthrough Assessment

8/10

Significantly deepens understanding of why PPO fails in long runs by connecting representation theory to trust region mechanics. Proposed solution (PFO) addresses a fundamental instability in deep RL.

⚙️ Technical Details

Problem Definition

Setting: Finite-horizon undiscounted Markov Decision Process (MDP)

Inputs: State S_t (pixels or features)

Outputs: Action A_t (policy distribution)

Pipeline Flow

Data Collection (Rollout with Policy)
Advantage Estimation (GAE)
Optimization Phase (PPO + PFO Update)

System Modules

Actor (Policy Network)

Map states to action probabilities

Model or implementation: Deep Neural Network (Architecture depends on task, e.g., ConvNet for Atari)

Critic (Value Network)

Estimate expected returns to compute advantages

Model or implementation: Deep Neural Network

Novel Architectural Elements

Proximal Feature Optimization (PFO): An auxiliary regularization term added to the PPO objective that constrains the change in the penultimate layer's pre-activations

Modeling

Base Model: Task-dependent (ConvNets for Atari, MLPs for MuJoCo)

Training Method: Online Reinforcement Learning (PPO)

Objective Functions:

Purpose: Maximize policy improvement while constraining change.

Formally: L_CLIP = E[min(r_t A_t, clip(r_t, 1-eps, 1+eps) A_t)]
Purpose: Minimize value prediction error.

Formally: L_VF = E[(V_pred - V_target)^2]
Purpose: (Proposed) Regularize representation change.

Formally: PFO auxiliary loss (regularizes change in pre-activations)

Key Hyperparameters:

clip_epsilon: Not explicitly reported in the paper
approximate_rank_threshold_delta: 0.01

Compute: Not reported in the paper

Reproducibility

Code: https://github.com/CLAIRE-Labo/no-representation-no-trust

Code and run histories are publicly available at https://github.com/CLAIRE-Labo/no-representation-no-trust. The paper mentions extensive logging database is provided.

📊 Experiments & Results

Evaluation Setup

Online RL training on standard benchmarks

Benchmarks:

Arcade Learning Environment (Atari) (Pixel-based discrete control)
MuJoCo (Continuous control)

Metrics:

Feature Rank (Approximate Rank)
Capacity Loss (Target-fitting)
Performance (Episodic Return)
Trust Region effectiveness (Clipping frequency)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

PPO agents exhibit feature rank collapse (dimensionality reduction in penultimate layer) during training on Atari and MuJoCo.
Collapse in representation rank correlates with capacity loss (inability to fit random targets), indicating loss of plasticity.
Degraded representations cause PPO's trust region (clipping) to fail, leading to destructive updates and performance collapse.
The breakdown of the trust region and representation collapse exacerbate each other in a vicious cycle.
Regularizing representation dynamics (PFO) successfully mitigates collapse and maintains the effectiveness of the trust region.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (MDPs, returns, advantages)
Proximal Policy Optimization (PPO)
Neural Network representation analysis (Rank, Singular Values)

Key Terms

PPO: Proximal Policy Optimization—a popular policy gradient method that constrains updates using a clipped surrogate objective to ensure a trust region

feature rank collapse: A phenomenon where the neural network's internal representations lose dimensionality, becoming less expressive and unable to distinguish states effectively

plasticity: The ability of a neural network to continue learning and adapting to new data distributions over time

capacity loss: A metric measuring the decrease in a network's ability to fit random target labels, indicating a loss of learning capability

trust region: A constraint in optimization (like PPO's clipping) that prevents the new policy from moving too far from the old policy to ensure safe updates

non-stationarity: The condition where the data distribution (states and rewards) changes over time, which is inherent in RL as the agent's policy changes

pre-activations: The values in a neural network layer before the non-linear activation function (e.g., ReLU) is applied

GAE: Generalized Advantage Estimator—a method to estimate the advantage function (how good an action is) by balancing bias and variance

PFO: Proximal Feature Optimization—the authors' proposed auxiliary loss to regularize changes in feature representations