Rethinking Entropy Regularization in Large Reasoning Models

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Large Reasoning Models (LRMs)

SIREN prevents entropy collapse in large reasoning models by restricting entropy regularization to a semantic nucleus of actions and critical trajectory tokens, stabilizing exploration without causing global entropy explosions.

Core Problem

Naive entropy regularization in LRMs fails because the vast action space and long trajectories cause probability mass to diffuse indiscriminately, triggering global entropy explosions and incoherent outputs.

Why it matters:

Current RLVR methods suffer from entropy collapse and premature convergence, leading to deterministic policies that stop exploring and produce repetitive, near-identical responses
Naive regularization is hypersensitive: small coefficients yield no gain, while large ones cause the model to assign high probability to meaningless tokens, destroying reasoning coherence

Concrete Example: In a math problem, a model with naive entropy regularization might output a uniform distribution of meaningless tokens instead of reasoning steps. The paper shows an entropy-exploded model where probability mass spreads across the entire vocabulary rather than focusing on valid mathematical operators or numbers.

Key Novelty

Selective Entropy Regularization (SIREN)

constrains exploration to a 'policy nucleus' (top-p tokens) to prevent probability mass from leaking into the vast sea of meaningless tokens in the vocabulary
targets regularization only on 'peak-entropy' tokens (critical logical connectors) along the trajectory, avoiding the cascade of uncertainty that ruins long chains of reasoning
anchors the regularization target to the model's initial entropy level (self-anchored), dynamically adjusting strength to maintain diversity without manual tuning

Architecture

The overall framework of SIREN, illustrating the two-step masking process (Action-level and Trajectory-level) and the self-anchored regularization loop.

Evaluation Highlights

+6.6 maj@32 improvement on AIME24 and AIME25 benchmarks using Qwen2.5-Math-7B compared to the best previous entropy-related baselines.
Achieves 54.6 maj@k average across five math benchmarks, outperforming Dr.GRPO by +4.8 points.
Consistent gains on smaller (Qwen2.5-Math-1.5B) and weaker (LLaMa3.1-8B) models, improving maj@k by +2.4 and +2.8 respectively.

Breakthrough Assessment

8/10

Identifies a fundamental failure mode of standard RL techniques in the LLM context (entropy explosion due to vocabulary size) and provides a logically grounded, effective fix that sets new state-of-the-art results.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning with Verifiable Rewards (RLVR) for mathematical reasoning

Inputs: Natural language query q (math problem)

Outputs: Reasoning trajectory o leading to a final answer

Pipeline Flow

Policy Generation (produces trajectory)
Mask Generation (selects tokens/actions for regularization)
Loss Computation (GRPO + SIREN Regularization)

System Modules

Reasoning Policy

Generate reasoning steps and final answer for a given math problem

Model or implementation: Qwen2.5-Math-7B (also 1.5B and LLaMa3.1-8B)

Top-p Action Mask (Regularization Selection)

Identify the 'policy nucleus' (valid tokens) for each step to restrict entropy calculation

Model or implementation: Statistical filter

Peak-Entropy Trajectory Mask (Regularization Selection)

Identify critical tokens in the sequence that have high uncertainty (top-tau quantile)

Model or implementation: Statistical filter

Novel Architectural Elements

Two-step masking mechanism for entropy regularization: filtering both the action space (top-p) and the temporal dimension (peak-entropy)
Self-anchored regularization objective replacing standard entropy maximization

Modeling

Base Model: Qwen2.5-Math-7B (primary), Qwen2.5-Math-1.5B, LLaMa3.1-8B

Training Method: Dr.GRPO with SIREN Regularization

Objective Functions:

Purpose: Optimize policy to maximize reward (correct answers).

Formally: J_PO = E[min(r_t A_t, clip(r_t, 1-e, 1+e)A_t)] (GRPO objective)
Purpose: Regularize entropy selectively to maintain exploration without explosion.

Formally: L_sa = (H_aggregated - H_anchor)^2 (Self-anchored MSE loss)
Purpose: Combine objectives.

Formally: J' = J_PO - beta * L_sa

Training Data:

OpenR1-Math-46k-8192 dataset (subset of OpenR1-Math-220k)
For LLaMa3.1: Constructed from OpenR1-Math-46k, GSM8K, MATH500

Key Hyperparameters:

clip_ratio: 0.28
rollout_batch_size: 128
update_batch_size: 8
+ 3 more
rollouts_per_prompt: 8 (Qwen), 16 (LLaMa)
temperature_rollout: 1.0
entropy_coefficient: Not explicitly reported in the paper (regularization weight beta)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Dr.GRPO: Adds selective regularization to prevent collapse/explosion
vs. Naive Entropy Reg.: Restricts scope to policy nucleus and critical tokens to prevent vocabulary-wide probability diffusion
vs. RL on forking tokens: SIREN regularizes entropy specifically, whereas forking tokens approach filters the gradient update scope [not cited in paper as direct contrast, but similar intuition]
+ 1 more
vs. DAPO [cited as Clip-Cov/KL-Cov]: SIREN encourages exploration via regularization rather than just penalizing collapse via covariance constraints

Limitations

Only evaluated on mathematical reasoning benchmarks; applicability to code or general reasoning not tested.
Requires verifiable rewards (binary correctness), limiting use in open-ended creative tasks.
Hyperparameter sensitivity for the quantile thresholds (top-p, tau) is not deeply analyzed in the main text.

Reproducibility

Code: https://github.com/OpusAttig/SIREN

Code is available at https://github.com/OpusAttig/SIREN. Datasets used (OpenR1-Math, GSM8K, MATH500) are public. Specific entropy coefficient beta values for SIREN are not explicitly detailed in the main text, though baseline coefficients are mentioned.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning with verifiable binary rewards

Benchmarks:

AIME24 (Competition Math)
AIME25 (Competition Math)
AMC (Competition Math)
OlympiadBench (Olympiad Math)
MATH500 (General Math)

Metrics:

maj@k (Majority vote accuracy)
avg@k (Average accuracy across k samples)
pass@k (Validation pass rate)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SIREN significantly outperforms baselines on challenging AIME benchmarks using Qwen2.5-Math-7B.
AIME24	maj@32	36.7	43.3	+6.6
AIME25	maj@32	20.0	26.7	+6.7
Overall average performance across 5 benchmarks shows consistent superiority.
Average (5 benchmarks)	maj@k	49.8	54.6	+4.8
Average (5 benchmarks)	avg@k	44.5	46.1	+1.6
Generalization to smaller and weaker models.
Average (5 benchmarks)	maj@k	42.0	44.4	+2.4
Average (5 benchmarks)	maj@k	20.1	22.9	+2.8

Experiment Figures

Comparison of probability distributions and token-level entropy between a model before RL, after RL with naive regularization (exploded), and implicitly motivating SIREN.

Training dynamics of validation pass@k and policy entropy over training steps.

Main Takeaways

SIREN effectively prevents premature convergence by maintaining higher entropy in the early stages of training compared to baselines.
Unlike naive regularization, SIREN does not degrade into incoherent output; it keeps probability mass concentrated on semantic tokens.
The method pushes the 'upper bound' of reasoning (pass@k) higher during training, whereas baselines often see pass@k degradation as entropy collapses.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, GRPO)
Entropy Regularization
Large Language Models (LLMs)
Top-p Sampling

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—using binary correct/incorrect feedback based on the final answer to train reasoning models

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled outputs for the same prompt, removing the need for a separate value network

Dr.GRPO: A variant of GRPO that modifies the advantage function by removing the standard deviation term in the denominator for better stability

Entropy Collapse: A phenomenon where a policy becomes deterministic too early in training, stopping exploration and producing identical outputs

Maj@k: Majority vote accuracy at k—generating k solutions and selecting the most frequent answer as the final prediction

Pass@k: The probability that at least one of k generated solutions is correct

Policy Nucleus: The subset of tokens in the vocabulary with the highest cumulative probability (top-p), representing semantically meaningful options

Self-anchored Regularization: A regularization term that penalizes deviations from the model's initial aggregated entropy rather than maximizing entropy indiscriminately