← Back to Paper List

Clip-Low Increases Entropy and Clip-High Decreases Entropy in Reinforcement Learning of Large Language Models

Jaesung R. Park, Junsu Kim, Gyeongman Kim, Jinyoung Jo, Sean Choi, Jaewoong Cho, Ernest K. Ryu
Not reported in the paper
arXiv (2025)
RL Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Optimization Stability
The clipping mechanism in PPO and GRPO structurally biases policy entropy—specifically, the lower clip on negative advantages increases entropy while the upper clip on positive advantages decreases it—driving entropy collapse.
Core Problem
In RLVR, Large Language Models (LLMs) quickly converge to a near-deterministic state ('entropy collapse') regardless of the reward signal, which hinders exploration and long-term learning progress.
Why it matters:
  • Entropy collapse prevents the model from exploring new reasoning paths, limiting the effectiveness of prolonged reinforcement learning training
  • Current mitigation strategies like KL-divergence penalties are heuristic interventions that do not address the underlying mechanistic cause of the collapse
  • Understanding this dynamic is crucial for mathematical reasoning tasks where sustained exploration is necessary to find correct solution paths
Concrete Example: When training an LLM with PPO using standard symmetric clipping (epsilon=0.2) on mathematical problems, the model's response diversity vanishes (entropy drops) even if the rewards are purely random noise, proving the algorithm itself forces determinism.
Key Novelty
Mechanistic Entropy Bias of Clipping
  • The paper theoretically proves that the 'clip-low' mechanism (limiting updates for negative advantages) acts as an entropy increaser, while 'clip-high' (limiting positive advantages) acts as an entropy decreaser
  • Demonstrates that under standard symmetric clipping settings, the 'clip-high' effect dominates, causing a net reduction in entropy irrespective of the reward signal
  • Proposes controlling entropy dynamics by deliberately tuning these clipping bounds asymmetrically (e.g., tightening clip-low) rather than relying solely on external entropy regularization
Evaluation Highlights
  • Theoretical proofs confirm clip-low increases entropy and clip-high decreases entropy in both Policy Gradient and Natural Policy Gradient settings
  • Empirical experiments on GSM8K with purely random rewards show consistent entropy reduction across Qwen, Llama, and Olmo families, refuting model-specific explanations
  • Adjusting clipping parameters (e.g., decreasing epsilon-low) successfully reverses entropy collapse in controlled experiments
Breakthrough Assessment
8/10
Provides a fundamental mechanistic explanation for a widespread problem (entropy collapse) that was previously treated with heuristics. The use of random rewards to isolate optimizer bias is a clever and convincing analytical tool.
×