Clip-Low Increases Entropy and Clip-High Decreases Entropy in Reinforcement Learning of Large Language Models

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Optimization Stability

The clipping mechanism in PPO and GRPO structurally biases policy entropy—specifically, the lower clip on negative advantages increases entropy while the upper clip on positive advantages decreases it—driving entropy collapse.

Core Problem

In RLVR, Large Language Models (LLMs) quickly converge to a near-deterministic state ('entropy collapse') regardless of the reward signal, which hinders exploration and long-term learning progress.

Why it matters:

Entropy collapse prevents the model from exploring new reasoning paths, limiting the effectiveness of prolonged reinforcement learning training
Current mitigation strategies like KL-divergence penalties are heuristic interventions that do not address the underlying mechanistic cause of the collapse
Understanding this dynamic is crucial for mathematical reasoning tasks where sustained exploration is necessary to find correct solution paths

Concrete Example: When training an LLM with PPO using standard symmetric clipping (epsilon=0.2) on mathematical problems, the model's response diversity vanishes (entropy drops) even if the rewards are purely random noise, proving the algorithm itself forces determinism.

Key Novelty

Mechanistic Entropy Bias of Clipping

The paper theoretically proves that the 'clip-low' mechanism (limiting updates for negative advantages) acts as an entropy increaser, while 'clip-high' (limiting positive advantages) acts as an entropy decreaser
Demonstrates that under standard symmetric clipping settings, the 'clip-high' effect dominates, causing a net reduction in entropy irrespective of the reward signal
Proposes controlling entropy dynamics by deliberately tuning these clipping bounds asymmetrically (e.g., tightening clip-low) rather than relying solely on external entropy regularization

Evaluation Highlights

Theoretical proofs confirm clip-low increases entropy and clip-high decreases entropy in both Policy Gradient and Natural Policy Gradient settings
Empirical experiments on GSM8K with purely random rewards show consistent entropy reduction across Qwen, Llama, and Olmo families, refuting model-specific explanations
Adjusting clipping parameters (e.g., decreasing epsilon-low) successfully reverses entropy collapse in controlled experiments

Breakthrough Assessment

8/10

Provides a fundamental mechanistic explanation for a widespread problem (entropy collapse) that was previously treated with heuristics. The use of random rewards to isolate optimizer bias is a clever and convincing analytical tool.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning for LLMs (RL-LLM) where the policy is optimized to maximize expected reward on reasoning tasks

Inputs: Prompt x sampled from training distribution D

Outputs: Response y (sequence of tokens)

Pipeline Flow

Policy Sampling (Generate K responses)
Reward Calculation (Oracle or Random)
Advantage Estimation (Group Relative)
Clipped Gradient Update

System Modules

Policy Model

Generate responses to prompts

Model or implementation: Qwen2.5-3B-Instruct or Llama3-8B-Instruct

Clipping Mechanism

Constrain the policy update ratio to [1-eps_low, 1+eps_high]

Model or implementation: Mathematical Operation

Modeling

Base Model: Qwen2.5-3B-Instruct and Llama3-8B-Instruct

Training Method: GRPO (Group Relative Policy Optimization) / DAPO variant

Objective Functions:

Purpose: Maximize expected reward while limiting policy deviation.

Formally: J = E[ min( r_t * A_t, clip(r_t, 1-eps_low, 1+eps_high) * A_t ) ]

Training Data:

GSM8K dataset used for prompts
Rewards are random (Bernoulli with p=0.5) for the primary analysis experiments

Key Hyperparameters:

learning_rate: 5e-7
batch_size_grpo: 512
batch_size_optimizer: 256
+ 4 more
clip_epsilon_low: Varied (0.2 in standard)
clip_epsilon_high: Varied (0.2 in standard)
temperature_rollout: 1.0
temperature_validation: 0.6

Compute: Not reported in the paper

Comparison to Prior Work

vs. DAPO: This paper provides the theoretical proof *why* asymmetric clipping works (by separating entropy contributions), whereas DAPO was more heuristic
vs. Shao et al. (2025): This paper shows random rewards cause entropy minimization (and potential collapse) across *all* model families (Llama, Olmo, Qwen), refuting the idea that it is Qwen-specific behavior
vs. ProRL: Suggests that tuning clipping parameters is a more fundamental control for entropy than adding auxiliary loss terms like KL divergence

Limitations

The theoretical analysis relies on first-order approximations of the Policy Gradient and Natural Policy Gradient updates
Primary empirical validation in the text snippet focuses on random rewards to prove the bias, rather than maximizing performance on benchmarks
The paper does not report statistical significance tests for the entropy trends (though trends appear consistent)
Long-term impact on reasoning accuracy (pass@k) is discussed qualitatively but detailed performance tables are not included in the provided text snippet

Reproducibility

The paper uses the open-source 'verl' framework. Implementation details like batch sizes and learning rates are provided. The datasets (GSM8K) and models (Qwen, Llama) are public. Code URL is not explicitly provided in the text snippet.

📊 Experiments & Results

Evaluation Setup

Analysis of policy entropy dynamics under RL training with random rewards

Benchmarks:

GSM8K (Prompts only) (Mathematical Reasoning)

Metrics:

Policy Entropy (Token-level Shannon entropy)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Entropy dynamics during training as clipping hyperparameters epsilon-low and epsilon-high are varied.

Policy entropy trends for Qwen, Llama, and Olmo models trained with random rewards.

Main Takeaways

Clipping is not entropy-neutral: Standard symmetric clipping (epsilon=0.2) inherently reduces entropy because the 'clip-high' effect dominates the 'clip-low' effect.
Lowering epsilon-low (tightening the lower clip) increases policy entropy, while lowering epsilon-high (tightening the upper clip) decreases it.
Training with purely random rewards leads to entropy reduction across all tested model families (Qwen, Llama, Olmo), indicating that RLVR algorithms function as entropy minimizers in the absence of signal.
Entropy collapse can be prevented mechanistically by tuning clipping thresholds (e.g., using asymmetric clipping) without needing auxiliary loss functions.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO/GRPO algorithms)
Policy Gradient methods
Shannon Entropy

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using RL to improve LLM reasoning where correctness can be automatically checked (e.g., math problems)

PPO: Proximal Policy Optimization—an RL algorithm that constrains policy updates using a clipping mechanism to prevent instability

GRPO: Group Relative Policy Optimization—a PPO variant that normalizes advantages within a group of outputs for the same prompt, often used for LLMs

Clip-low: The lower bound of the PPO clipping mechanism (1 - epsilon_low) applied when the advantage is negative

Clip-high: The upper bound of the PPO clipping mechanism (1 + epsilon_high) applied when the advantage is positive

Entropy collapse: The phenomenon where a policy becomes deterministic (zero entropy) too quickly during training, stopping exploration

Advantage: A value estimating how much better a specific action was compared to the average or expected action