← Back to Paper List

Overestimation, Overfitting, and Plasticity in Actor-Critic: the Bitter Lesson of Reinforcement Learning

Michal Nauman, Michal Bortkiewicz, M. Ostaszewski, Piotr Milo's, Tomasz Trzci'nski, Marek Cygan
International Conference on Machine Learning (2024)
RL Benchmark

📝 Paper Summary

Off-policy Reinforcement Learning Deep Learning Regularization
General neural network regularization techniques like Layer Normalization outperform complex, RL-specific algorithmic interventions in stabilizing off-policy agents and preventing plasticity loss.
Core Problem
Off-policy RL agents with high replay ratios suffer from instability issues like value overestimation, overfitting, and plasticity loss, which are typically treated with narrow, domain-specific algorithmic fixes.
Why it matters:
  • Current solutions are often tested in isolation on limited benchmarks, masking whether improvements come from specific RL mechanics or general stability
  • Standard model-free agents fail completely on complex tasks like the Dog domain due to plasticity loss, previously necessitating complex model-based approaches
  • The 'Bitter Lesson' suggests generic computation/regularization scales better than hand-crafted algorithmic heuristics, but this hasn't been fully verified for RL stability
Concrete Example: In the 'dog-run' task, a standard Soft Actor-Critic agent fails to learn because its neural networks lose 'plasticity' (the ability to adapt) due to frequent updates. RL-specific fixes like 'Generalized Pessimism Learning' fail to solve this, whereas simply adding Layer Normalization allows the agent to learn a successful running policy.
Key Novelty
The 'Bitter Lesson' for RL Regularization
  • Systematically decouples regularization methods into three groups: Critic Regularization (algorithmic fixes), Network Regularization (architectural fixes), and Plasticity Regularization (learning dynamics fixes)
  • Demonstrates that generic deep learning regularizers (specifically Layer Norm and Spectral Norm) are more effective at reducing Q-value overestimation than methods explicitly designed for that purpose (like Clipped Double Q-learning)
Evaluation Highlights
  • Network regularization (Layer Norm) enables model-free SAC agents to solve Dog domain tasks (e.g., dog-trot, dog-run), which were previously considered impossible for model-free approaches
  • Layer Normalization is found to be more effective at reducing critic overestimation than Clipped Double Q-learning, a standard RL-specific technique
  • Combining Network Regularization with Plasticity Regularization (Resets) yields state-of-the-art robustness across 14 diverse tasks in DeepMind Control and MetaWorld
Breakthrough Assessment
8/10
Provides a strong empirical rebuttal to the trend of complex RL-specific algorithmic fixes, showing that standard DL regularization is often superior. The result on Dog tasks for model-free agents is a significant milestone.
×