← Back to Paper List

Entropy-Preserving Reinforcement Learning

Aleksei Petrenko, Ben Lipkin, Kevin Chen, Erik Wijmans, Marco Cusumano-Towner, Raja Giryes, Philipp Krähenbühl
arXiv (2026)
RL Reasoning Agent

📝 Paper Summary

Reinforcement Learning for Language Models Policy Gradient Optimization
This paper identifies that standard policy gradient algorithms and numerical precision issues cause premature entropy collapse in language models, proposing entropy-preserving objectives (REPO, ADAPO) to maintain diversity and improve performance.
Core Problem
Online policy gradient algorithms (like GRPO and PPO) often suffer from 'entropy collapse,' where the policy distribution narrows too quickly around a local optimum.
Why it matters:
  • Collapse degrades the diversity of generated solutions (lowering pass@k metric), leaving the model brittle and unable to explore alternative correct paths
  • Premature convergence limits the model's ability to retain trainability for sequential learning in new environments
  • Implementation factors like BF16 precision and framework behaviors (FSDP2 casting) silently accelerate this collapse, causing training instability
Concrete Example: A base model might output five different valid Python scripts to solve a problem (high diversity/entropy). After standard GRPO training, it collapses to generating the exact same script five times. If that specific script fails an edge case, the model has no alternative solution (entropy collapse), whereas an entropy-preserving policy would retain the diverse options.
Key Novelty
Entropy-Preserving Policy Optimization & Numerical Stabilization
  • Argues that the 'trajectory' of entropy during training is more critical than the final value; preserving diversity throughout the 'journey' prevents local optima
  • Identifies that low-precision arithmetic (BF16) and framework casting (FSDP2) artificially accelerate entropy collapse, and proposes numerical fixes
  • Introduces REPO (modifies advantage function) and ADAPO (adaptive asymmetric clipping) to explicitly regulate entropy dynamics
Evaluation Highlights
  • Achieves 79% Test Normal accuracy on AppWorld benchmark using numerical fixes alone (claimed State-of-the-Art)
  • Achieves 71% Test Challenge accuracy on AppWorld
  • Entropy-preserving methods (REPO, ADAPO) close the performance gap to on-policy training while retaining trainability
Breakthrough Assessment
9/10
Identifies a critical, overlooked cause of RL failure (numerical precision in entropy dynamics) and provides both theoretical analysis of entropy in PPO/GRPO and practical SOTA results.
×