Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

📝 Paper Summary

Reinforcement Learning with Verifiable Rewards (RLVR) Chain-of-Thought (CoT) Reasoning

Reasoning performance in RLVR is driven almost entirely by a minority of high-entropy 'forking' tokens that determine reasoning paths, allowing for effective training using only gradients from the top 20% most uncertain tokens.

Core Problem

Current RLVR methods train on all tokens equally, neglecting the heterogeneous functional roles tokens play in reasoning and failing to prioritize critical decision points.

Why it matters:

Computational inefficiency: Gradient updates are wasted on low-entropy tokens (suffixes, predictable completions) that do not influence reasoning logic.
Lack of interpretability: We do not fully understand which parts of a Chain-of-Thought trajectory actually drive the improvements seen in modern reasoning models.
Potential optimization interference: Training on low-entropy tokens might introduce noise or hinder performance, as seen when exclusively training on the bottom 80%.

Concrete Example: In a mathematical derivation, tokens like 'suppose' or 'thus' (high entropy) steer the logic, while tokens completing a LaTeX formula (low entropy) are deterministic. Standard RLVR treats them the same. The paper shows that artificially lowering the temperature of 'forking' tokens degrades performance, proving their specific importance.

Key Novelty

Entropy-Based Sparse RLVR (Forking Token Optimization)

Identifies 'forking tokens': A small minority (top ~20%) of tokens in Chain-of-Thought reasoning that have high entropy and act as pivotal decision points.
Proposes a sparse training strategy: Restricting policy gradient updates exclusively to these high-entropy tokens while masking the remaining 80%.
Demonstrates that this sparse update method matches or exceeds full-gradient performance, suggesting the '80/20 rule' applies strictly to token contribution in reasoning.

Architecture

Conceptual illustration of 'Forking Tokens' in a Chain-of-Thought

Evaluation Highlights

+11.04 accuracy on AIME'25 and +7.71 on AIME'24 for Qwen3-32B using only the top 20% high-entropy tokens compared to the base model.
The 32B model trained on sparse tokens achieves 63.5 on AIME'24 and 56.7 on AIME'25, setting a new SOTA for models under 600B parameters.
Training exclusively on the bottom 80% lowest-entropy tokens causes severe performance degradation, confirming they contribute little to reasoning improvements.

Breakthrough Assessment

9/10

Reveals a fundamental mechanism of CoT reasoning (forking tokens) and achieves SOTA results with a method that theoretically requires computing gradients for only 20% of tokens.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning on Chain-of-Thought reasoning trajectories with binary verification rewards

Inputs: Math/reasoning queries q

Outputs: Chain-of-thought response o containing reasoning steps and final answer

Pipeline Flow

Rollout Generation (Response sampling)
Entropy Calculation & Masking
Policy Optimization (DAPO)

System Modules

Rollout Generation

Generate group of responses for query q using current policy

Model or implementation: Qwen3-Base (8B, 14B, 32B)

Entropy Calculation

Compute entropy for each generated token to identify forking tokens

Model or implementation: Current Policy Network

Policy Optimization

Update model weights using sparse gradients

Model or implementation: Qwen3-Base

Novel Architectural Elements

Entropy-based Gradient Masking: A mechanism that dynamically filters policy gradient updates, applying them only to the top-k% highest entropy tokens in a sequence during RL training.

Modeling

Base Model: Qwen3-8B, Qwen3-14B, Qwen3-32B

Training Method: DAPO (Dynamic sAmpling Policy Optimization) with sparse entropy updates

Objective Functions:

Purpose: Optimize policy using only high-entropy tokens.

Formally: Maximize sum over t of (M_t * r_t(theta) * A_t), where M_t is 1 if H_t is in top 20% and 0 otherwise.

Key Hyperparameters:

clip_epsilon: 0.2
entropy_threshold_percentile: Top 20%
decoding_temperature: 1.0
+ 1 more
max_response_length: 28,672 (for analysis), 20k-29k (for results)

Compute: Not reported in the paper

Comparison to Prior Work

vs. DAPO (baseline): Proposed method uses DAPO but masks gradients for 80% of tokens, whereas standard DAPO updates all tokens.
vs. OpenAI o1/DeepSeek R1: Focuses on token-level efficiency mechanisms within RLVR rather than just scaling parameters or data.
vs. Token-level Value Estimation [not cited in paper]: Unlike methods that try to assign credit to every token via a critic, this method explicitly ignores the majority of tokens based on entropy.

Limitations

Analysis relies heavily on mathematical reasoning benchmarks (AIME); generalization to other domains (coding, creative writing) is not explored.
The optimal threshold (20%) is empirical; the paper does not provide a theoretical derivation for why 20% is the specific sweet spot.
Smaller models (8B) show smaller gains from this method compared to larger models (32B), suggesting capacity limitations.
Requires computing entropy for all tokens during the forward pass to determine the mask, so computational savings are in the backward pass only.

Reproducibility

Code availability is not provided in the paper text. The method relies on computing per-token entropy during training, which is a standard calculation in RL pipelines. Qwen3 models are publicly available, but specific training scripts are not linked.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning using Chain-of-Thought

Benchmarks:

AIME 2024 (Mathematical Problem Solving)
AIME 2025 (Mathematical Problem Solving)

Metrics:

Accuracy (Pass@1)
Token Entropy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaling results on Qwen3 models show that the sparse entropy method becomes more effective as model size increases.
AIME 2025	Accuracy	Not reported in the paper	56.7	+11.04
AIME 2024	Accuracy	Not reported in the paper	63.5	+7.71
AIME 2025	Accuracy	Not reported in the paper	Not reported in the paper	+4.79
AIME 2024	Accuracy	63.5	68.1	+4.6

Experiment Figures

Statistical distribution of token entropy in CoT responses

Impact of manually adjusting decoding temperature for high vs. low entropy tokens

Main Takeaways

Sparse training works: Utilizing only the top 20% high-entropy tokens maintains or exceeds the performance of full-gradient updates.
Low entropy is irrelevant: Training exclusively on the 80% lowest-entropy tokens leads to severe performance decline, indicating they carry little reasoning value.
Scaling trend: The benefits of sparse high-entropy training increase with model size (32B > 14B > 8B), possibly because larger models have better exploration capabilities.
RLVR preserves patterns: The entropy profile of the model remains largely stable during training; RLVR mostly adjusts the magnitude of already high-entropy tokens rather than creating new ones.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Policy Gradient methods (PPO/GRPO)
Token-level entropy
Chain-of-Thought (CoT) prompting

Key Terms

RLVR: Reinforcement Learning with Verifiable Rewards—optimizing models using RL where the reward is a binary correctness check (e.g., math answer is correct).

forking tokens: High-entropy tokens that act as decision points in a reasoning chain, branching into different potential logical paths.

DAPO: Dynamic sAmpling Policy Optimization—an RLVR algorithm that modifies GRPO by removing the KL penalty and adding clip-higher mechanisms and token-level loss masking.

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the group average, removing the need for a critic network.

PPO: Proximal Policy Optimization—a standard RL algorithm that limits policy updates to a small region to ensure stability.

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.

entropy: A measure of uncertainty in the model's next-token prediction distribution; high entropy means the model considers many possible next tokens.

token entropy: The entropy of the distribution at time t, calculated using the model's logits, not the specific sampled token.

AIME: American Invitational Mathematics Examination—a challenging math competition used as a benchmark for reasoning models.