CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

📝 Paper Summary

Reinforcement Learning for LLMs Mathematical Reasoning Policy Optimization Algorithms

CE-GPPO stabilizes reinforcement learning by reintroducing gradients from clipped low-probability tokens using a stop-gradient mechanism to explicitly regulate policy entropy.

Core Problem

Standard PPO clipping discards gradients from low-probability tokens, causing either entropy collapse (premature convergence) or entropy explosion (instability) during RL training.

Why it matters:

Entropy collapse prevents models from exploring diverse reasoning paths, leading to suboptimal solutions in complex tasks like math reasoning
Existing fixes like 'clip-higher' (DAPO) only address one side of the problem, leading to potential over-exploration or instability
Unregulated entropy dynamics make RL fine-tuning of Large Language Models notoriously unstable and sensitive to hyperparameters

Concrete Example: In PPO, if a token's probability drops significantly (negative advantage, low probability), its ratio falls below 1-epsilon and its gradient is zeroed out. The model loses the signal to 'stop exploring this bad path', causing it to paradoxically explore even more, leading to entropy explosion.

Key Novelty

Gradient-Preserving Clipping Policy Optimization (CE-GPPO)

Identifies that clipped tokens are usually low-probability tokens that are critical for regulating entropy: PA&LP (Positive Advantage & Low Prob) aid exploration, while NA&LP (Negative Advantage & Low Prob) aid exploitation
Reintroduces these clipped gradients back into the update using a stop-gradient operator to keep their magnitude bounded but non-zero
Uses two scaling coefficients to independently weight exploration-inducing gradients and exploitation-inducing gradients, balancing the entropy curve

Architecture

A conceptual illustration of the PPO clipping mechanism and how CE-GPPO modifies it. It shows the distribution of tokens relative to their probability and the clipping interval.

Evaluation Highlights

Outperforms PPO and DAPO baselines on AIME24, AIME25, and MATH500 benchmarks across 1.5B and 7B model scales
Achieves +3.0% accuracy improvement on AIME24 with the 7B model compared to the best baseline (DAPO)
Maintains stable entropy throughout training, avoiding the rapid collapse seen in GRPO and the initial explosion seen in DAPO

Breakthrough Assessment

7/10

Offers a theoretically grounded and empirically effective fix for a fundamental PPO issue (entropy instability) in the context of LLM reasoning, with consistent gains.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for mathematical reasoning tasks

Inputs: Math problem prompt x

Outputs: Step-by-step reasoning chain and final answer y

Pipeline Flow

Prompt Sampling
Response Generation (Rollout)
Reward Calculation
Advantage Estimation (Group Relative)
Policy Update (CE-GPPO)

System Modules

Policy Model

Generates reasoning steps and answers given a math prompt

Model or implementation: DeepSeek-R1-Distill-Qwen-1.5B / 7B

Reward Function

Verifies correctness of the final answer

Model or implementation: Rule-based checker

Novel Architectural Elements

Modified PPO Loss Function: Adds a term for clipped tokens using stop-gradient operators to preserve their gradients with scaling factors beta1 and beta2

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B

Training Method: CE-GPPO (Gradient-Preserving Clipping Policy Optimization)

Objective Functions:

Purpose: Maximize expected reward while regulating entropy via gradient preservation.

Formally: L(theta) = E[ min(r*A, clip(r)*A) + I(r < 1-eps)*beta1*(1-eps)*sg(A) + I(r > 1+eps)*beta2*(1+eps)*sg(A) ]

Adaptation: Full fine-tuning

Trainable Parameters: All parameters of the LLM

Training Data:

KlearReasoner-MathSub-30K dataset (approx 30k samples)
Sources: Skywork-OR1, Acereason, NuminaMath, DeepScaleR
9-gram deduplication applied

Key Hyperparameters:

learning_rate: 1e-6
clip_epsilon: 0.2
beta1: Not explicitly reported in the paper
+ 5 more
beta2: Not explicitly reported in the paper
num_rollouts: 8
max_sequence_length: 16k
training_steps: 1000
kl_penalty: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. PPO/GRPO: CE-GPPO adds gradients from clipped regions (outside 1-epsilon, 1+epsilon) instead of zeroing them
vs. DAPO: DAPO only extends the upper bound (1+epsilon_h) to capture PA&LP tokens; CE-GPPO captures both PA&LP and NA&LP tokens and regulates them via scaling coefficients beta
vs. TRPO [not cited in paper]: TRPO uses a hard KL constraint; CE-GPPO manages trust region implicitly via clipping + soft gradient reintroduction

Limitations

Specific values for the key hyperparameters beta1 and beta2 are not provided in the text
Experiments limited to mathematical reasoning tasks; generalization to other domains (coding, creative writing) is untested
Relies on verifiable rewards (math answers), making it less applicable to subjective tasks without clear ground truth

Reproducibility

Code: https://github.com/KlearTeam/KlearReasoner

Code is publicly available at https://github.com/KlearTeam/KlearReasoner. Dataset KlearReasoner-MathSub-30K is described but availability depends on the repository. Specific values for beta1 and beta2 (key scaling factors) are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on standard competitive benchmarks

Benchmarks:

AIME 2024 (Math Competition)
AIME 2025 (Math Competition)
MATH-500 (Math Problems)
HMMT 2025 (Math Competition)
AMC 2023 (Math Competition)

Metrics:

avg@32 (Accuracy averaged over 32 samples)
avg@4 (For MATH-500)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CE-GPPO consistently outperforms baselines on 7B models across challenging math benchmarks.
AIME 2024	avg@32	59.3	62.3	+3.0
AIME 2025	avg@32	46.2	51.3	+5.1
MATH-500	avg@4	89.0	90.0	+1.0
CE-GPPO also shows significant improvements on smaller 1.5B models.
AIME 2024	avg@32	43.3	45.8	+2.5
HMMT 2025	avg@32	41.0	43.0	+2.0

Experiment Figures

Training curves for AIME24 Accuracy and Policy Entropy over training steps for GRPO, DAPO, and CE-GPPO.

Main Takeaways

CE-GPPO achieves consistent performance gains over GRPO and DAPO across multiple model scales (1.5B, 7B) and benchmarks.
Analysis of entropy dynamics shows CE-GPPO maintains higher and more stable entropy than GRPO (which collapses) and avoids the initial spike/instability seen in DAPO.
The method is robust to hyperparameter variations (though exact optimal betas are not detailed), suggesting the mechanism of reintroducing clipped gradients is fundamentally sound for stabilizing RLVR.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Importance Sampling
Policy Gradient Theorem
Entropy in Information Theory

Key Terms

PPO: Proximal Policy Optimization—an RL algorithm that clips updates to prevent the policy from changing too drastically

GRPO: Group Relative Policy Optimization—a critic-free variant of PPO that normalizes rewards within a group of outputs for the same prompt

PA&LP: Positive-Advantage Low-Probability tokens—tokens that are good actions (positive advantage) but currently unlikely; their gradients encourage exploration

NA&LP: Negative-Advantage Low-Probability tokens—tokens that are bad actions (negative advantage) and currently unlikely; their gradients encourage exploitation (convergence)

stop gradient: An operation that prevents error signals from backpropagating through a specific part of the computation graph, used here to decouple the clipping condition from the gradient value

importance sampling ratio: The ratio of the probability of an action under the new policy vs. the old policy; used to estimate the new policy's value using old data

entropy collapse: A failure mode where the policy becomes deterministic too quickly, stopping exploration

DAPO: Decoupled Advantage Policy Optimization—a baseline method that extends the upper clipping bound to encourage exploration

AIME: American Invitational Mathematics Examination—a challenging math competition benchmark

avg@32: Evaluation metric averaging the score over 32 sampled responses per prompt