Entropy-Preserving Reinforcement Learning

📝 Paper Summary

Reinforcement Learning for Language Models Policy Gradient Optimization

This paper identifies that standard policy gradient algorithms and numerical precision issues cause premature entropy collapse in language models, proposing entropy-preserving objectives (REPO, ADAPO) to maintain diversity and improve performance.

Core Problem

Online policy gradient algorithms (like GRPO and PPO) often suffer from 'entropy collapse,' where the policy distribution narrows too quickly around a local optimum.

Why it matters:

Collapse degrades the diversity of generated solutions (lowering pass@k metric), leaving the model brittle and unable to explore alternative correct paths
Premature convergence limits the model's ability to retain trainability for sequential learning in new environments
Implementation factors like BF16 precision and framework behaviors (FSDP2 casting) silently accelerate this collapse, causing training instability

Concrete Example: A base model might output five different valid Python scripts to solve a problem (high diversity/entropy). After standard GRPO training, it collapses to generating the exact same script five times. If that specific script fails an edge case, the model has no alternative solution (entropy collapse), whereas an entropy-preserving policy would retain the diverse options.

Key Novelty

Entropy-Preserving Policy Optimization & Numerical Stabilization

Argues that the 'trajectory' of entropy during training is more critical than the final value; preserving diversity throughout the 'journey' prevents local optima
Identifies that low-precision arithmetic (BF16) and framework casting (FSDP2) artificially accelerate entropy collapse, and proposes numerical fixes
Introduces REPO (modifies advantage function) and ADAPO (adaptive asymmetric clipping) to explicitly regulate entropy dynamics

Evaluation Highlights

Achieves 79% Test Normal accuracy on AppWorld benchmark using numerical fixes alone (claimed State-of-the-Art)
Achieves 71% Test Challenge accuracy on AppWorld
Entropy-preserving methods (REPO, ADAPO) close the performance gap to on-policy training while retaining trainability

Breakthrough Assessment

9/10

Identifies a critical, overlooked cause of RL failure (numerical precision in entropy dynamics) and provides both theoretical analysis of entropy in PPO/GRPO and practical SOTA results.

⚙️ Technical Details

Problem Definition

Setting: Markov Decision Process (MDP) for Language Modeling where a policy generates tokens to maximize expected terminal reward

Inputs: Context string c (prompt/state)

Outputs: Trajectory tau (sequence of actions/tokens)

Pipeline Flow

Sampling: Generate K trajectories from Policy given Context
Evaluation: Compute Rewards for trajectories
Advantage Estimation: Compute Advantages (e.g., via RLOO)
Update: Optimize Policy via Gradient Ascent on Entropy-Regularized Objective

System Modules

Policy Model

Autoregressive Language Model generating actions (tokens)

Model or implementation: Not explicitly specified in extracted text (likely LLM)

Entropy Controller

Adaptive mechanism (in REPO/ADAPO) to monitor and adjust entropy targets

Novel Architectural Elements

Explicit entropy control mechanisms integrated into the advantage function (REPO) and clipping bounds (ADAPO)
Adaptive controllers to maintain target entropy levels throughout the optimization trajectory

Modeling

Base Model: Language Models (specific architecture not listed in text)

Training Method: Policy Gradient RL (PPO, GRPO, REPO, ADAPO)

Objective Functions:

Purpose: Maximize expected reward in MDP.

Formally: J_MDP = E[R(c, tau)]
Purpose: Standard PPO objective with clipping.

Formally: E[min(r_t A_t, clip(r_t, 1-eps, 1+eps) A_t)]
Purpose: RLOO Advantage estimation.

Formally: A(c, tau_i) = R(c, tau_i) - (1/(K-1)) * Sum_{j!=i} R(c, tau_j)

Compute: Not reported in the paper

Comparison to Prior Work

vs. GRPO: GRPO leads to premature pass@1 optimization at expense of pass@k; REPO/ADAPO maintain pass@k diversity
vs. PPO: Standard PPO allows entropy decay; ADAPO uses asymmetric clipping to actively preserve it
vs. Standard Implementations: Identifies BF16/FSDP2 as active harms to entropy; proposes precision fixes

Limitations

Analysis relies on first-order Taylor approximations of training dynamics
Requires active monitoring of entropy which adds complexity compared to standard PPO
Specifics of the REPO and ADAPO equations are not fully detailed in the provided snippet
Benefits depend heavily on implementation details like numerical precision

Reproducibility

No replication artifacts mentioned in the paper. The text mentions 'Our numerical fixes alone yield state-of-the-art', but provides no code URL.

📊 Experiments & Results

Evaluation Setup

Online policy gradient reinforcement learning on reasoning tasks

Benchmarks:

AppWorld (Interactive coding/agentic task)

Metrics:

Test Normal Accuracy
Test Challenge Accuracy
Entropy Trajectory
pass@1
pass@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AppWorld	Test Normal Accuracy	Not reported in the paper	79%	Not reported in the paper
AppWorld	Test Challenge Accuracy	Not reported in the paper	71%	Not reported in the paper

Main Takeaways

Entropy trajectory is a better predictor of final performance than final entropy alone ('it's the journey, not the destination')
Standard PPO and GRPO naturally reduce entropy, leading to collapse and loss of diversity (pass@k)
Numerical precision (BF16 vs FP16) and framework choices (FSDP2) are significant, previously unidentified factors in entropy collapse
Explicit entropy control (REPO, ADAPO) allows models to maintain exploration capability and trainability for sequential learning

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Gradients)
Language Modeling
Information Theory (Entropy)

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that uses clipped importance weights to prevent destructive policy updates

GRPO: Group Relative Policy Optimization—an online RL algorithm that baselines advantages against a group of samples for the same prompt

RLOO: REINFORCE Leave-One-Out—an unbiased advantage estimator that uses peer samples as the baseline for the current sample

Entropy Collapse: The phenomenon where a model's probability distribution narrows too quickly, losing the ability to generate diverse outputs

FSDP2: Fully Sharded Data Parallel (Version 2)—a PyTorch framework for distributed training that handles parameter sharding

BF16: BFloat16—a 16-bit floating point format commonly used in ML that can suffer from precision issues in entropy calculations

GSPO: Group Sequence Policy Optimization—an algorithm using a trajectory-level trust region based on geometric average probability ratios

REPO: A proposed family of algorithms that modify the advantage function to regulate entropy (introduced in this paper)

ADAPO: Adaptive Asymmetric PPO—a proposed approach using adaptive clipping to maintain target entropy levels

pass@k: A metric measuring the probability that at least one of k generated solutions is correct