The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

📝 Paper Summary

Unsupervised Fine-tuning Reinforcement Learning (RL) for Reasoning Inference-time Scaling

Minimizing the entropy of a pre-trained LLM's outputs, without any labeled data or external supervision, significantly improves its performance on complex math and coding reasoning tasks.

Core Problem

Standard post-training methods like Supervised Fine-Tuning (SFT) and RL require expensive labeled data or reward models, and it is unclear if models can self-improve using only their pre-trained capabilities.

Why it matters:

Labeled data for complex reasoning tasks (e.g., scientific coding) is scarce, expensive to annotate, and often hard to verify automatically
Pre-trained models likely already possess latent reasoning capabilities that are underutilized by standard decoding strategies
Current self-improvement methods often rely on majority voting or outcome verification, which are inapplicable when answers cannot be easily extracted or verified (e.g., creative coding)

Concrete Example: In scientific coding tasks like SciCode where output verification is hard, a standard model might generate diverse but incorrect solutions due to high uncertainty. EM forces the model to 'commit' to its most confident path, often recovering the correct solution where exploration would fail.

Key Novelty

Entropy Minimization (EM) as a standalone objective

Treats high confidence as a proxy for correctness in capable pre-trained models, training them to simply be 'more sure' of their own generations
Introduces three unlabeled methods: EM-FT (fine-tuning on model samples to minimize token entropy), EM-RL (RL with negative entropy as the only reward), and EM-INF (inference-time logit adjustment)
demonstrates that reducing uncertainty alone—without ground truth labels or verifiers—can elicit strong reasoning behaviors

Evaluation Highlights

Qwen-32B with EM-INF matches or exceeds GPT-4o and Claude 3 Opus on the challenging SciCode benchmark
EM-RL on Qwen-7B outperforms strong labeled RL baselines (GRPO, RLOO trained on 60K labeled examples) on LeetCode and Minerva math tasks without seeing a single label
EM-FT improves base model performance by ~8% on average across math and coding tasks using only unlabeled prompts

Breakthrough Assessment

8/10

Surprisingly effective simple objective that challenges the assumption that external supervision is needed for reasoning improvements. Performance matching labeled baselines is a significant finding for unsupervised learning.

⚙️ Technical Details

Problem Definition

Setting: Post-training adaptation and inference-time scaling of Autoregressive LLMs on reasoning tasks without labeled data

Inputs: Input prompt x (e.g., math problem, coding specification)

Outputs: Output trajectory y (reasoning chain and final answer)

Pipeline Flow

Prompt Input
Generation (w/ Gradient-based Logit Adjustment for EM-INF)
Output Selection

System Modules

Base LLM

Generate reasoning traces and answers

Model or implementation: Qwen2.5-Math-7B / Eurus-2-7B-SFT / Qwen-32B / Llama-3.1-8B

Logit Adjuster

Modify logits during decoding to minimize entropy (EM-INF only)

Model or implementation: Gradient-based optimization

Novel Architectural Elements

Inference-time loop that optimizes logits for minimum entropy per step (EM-INF) rather than just sampling or beam search

Modeling

Base Model: Qwen2.5-Math-7B (Math), Eurus-2-7B-SFT (Coding), Qwen2.5-32B-Instruct (SciCode), Llama-3.1-8B-Instruct

Training Method: Unsupervised Fine-tuning (EM-FT) and Reinforcement Learning (EM-RL)

Objective Functions:

Purpose: Minimize token-level entropy on sampled outputs (EM-FT).

Formally: Min E[Sum_t H(pi( . | y_<t))]
Purpose: Maximize negative trajectory entropy via RL (EM-RL-sequence).

Formally: Reward r(y) = log pi(y)
Purpose: Maximize negative token entropy via RL (EM-RL-token).

Formally: Reward r(y) = - Sum_t H(pi( . | y_<t))

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (7B models)

Training Data:

35K unlabeled prompts from Numina math
25K unlabeled prompts from Eurus-2 coding split

Key Hyperparameters:

learning_rate: 1e-6
batch_size: 512
rollouts_N: 4
+ 2 more
kl_coefficient_beta: 0.001
optimizer: Not explicitly reported in the paper

Compute: Training done on 4xGH200 Nvidia GPUs; Training FLOPs estimated as 6PD, Inference FLOPs as 2PD

Comparison to Prior Work

vs. GRPO/RLOO: EM-RL/EM-FT use NO labeled data or verifiers, whereas baselines require 60K labeled examples with verifiable outputs
vs. Self-consistency: EM works for tasks like code generation where 'majority voting' is hard to define/extract; EM-INF is 3x more efficient than sampling multiple paths
vs. Test-Time Adaptation [not cited in paper]: TENT minimizes entropy on test data for classifiers; EM-INF applies similar logic to autoregressive generation logits

Limitations

Effectiveness relies on model confidence correlating with correctness; fails on alignment tasks (e.g. value reasoning) where confidence is not a proxy for quality
Requires a capable pre-trained base model; gains on weaker models (Llama-3.1-8B) are less substantial than on stronger ones (Qwen-2.5)
Can degrade performance if the base model is poorly calibrated or the task is out-of-distribution relative to pre-training

Reproducibility

Code: https://github.com/shivamag125/EM_PT

Code available at https://github.com/shivamag125/EM_PT. Training prompts derived from public datasets (Numina, Eurus-2). Exact split of prompts not explicitly listed but reproducible via random sampling described. Base models are open weights.

📊 Experiments & Results

Evaluation Setup

Math and Coding reasoning tasks evaluated on generated outputs

Benchmarks:

MATH-500 (Mathematics problems)
Minerva (Mathematics problems)
LeetCode (Code generation)
SciCode (Scientific coding (Physics + Math + Code))
AMC / AIME 2024 (Competition Math)

Metrics:

Pass@1 (Accuracy)
Pass Rate
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Unsupervised EM-RL methods achieve competitive or better performance than supervised RL baselines (RLOO, GRPO) on math and coding benchmarks.
Minerva	Pass@1	34.2	35.9	+1.7
AMC	Pass@1	56.6	57.8	+1.2
LeetCode	Pass@1	28.3	31.1	+2.8
LeetCode	Pass@1	26.1	31.1	+5.0
Inference-time entropy minimization (EM-INF) scales performance without training, often matching computationally expensive sampling methods.
SciCode	Pass@1	18.8	23.1	+4.3
SciCode	Pass@1	19.6	23.1	+3.5

Main Takeaways

Entropy minimization alone (EM-FT, EM-RL) on unlabeled data yields comparable or better performance than RL with ground-truth labels (GRPO, RLOO) on complex reasoning tasks.
Inference-time entropy reduction (EM-INF) is highly effective for tasks with high uncertainty like SciCode, outperforming self-consistency while being 3x more efficient.
The method is less effective on weaker base models (Llama-3.1-8B) or tasks where confidence is not a proxy for correctness (e.g. alignment), highlighting the dependency on pre-trained capabilities.
Token-level vs Sequence-level entropy rewards have different optimal use cases: token-level (encouraging deterministic steps) works better for long-chain reasoning, while sequence-level (allowing some path diversity) can be better for shorter tasks.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Policy Gradients (REINFORCE/RLOO)
Shannon Entropy (token-level vs. sequence-level)
Supervised Fine-tuning (SFT)
Logits and Sampling in LLMs

Key Terms

Entropy Minimization (EM): A training or inference objective that forces the model's probability distribution to become sharper (more confident), concentrating mass on fewer outputs

EM-FT: Unsupervised fine-tuning where the model is trained to minimize the token-level entropy of its own sampled outputs

EM-RL: Reinforcement learning where the reward signal is solely the negative entropy (sequence or token level) of the generated sequence

EM-INF: Inference-time method that adjusts logits during decoding to reduce entropy without updating model parameters

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of samples for the same prompt to reduce variance

RLOO: Reinforce Leave-One-Out—an RL baseline that uses the mean reward of other samples as a baseline to reduce variance

Self-consistency: An inference strategy that samples multiple reasoning paths and selects the most frequent final answer (majority voting)

Token-level entropy: The entropy of the probability distribution over the vocabulary at a specific generation step

Trajectory-level entropy: The entropy of the distribution over entire sequences (estimated via log-probability of the sequence)

SciCode: A challenging benchmark for scientific coding tasks requiring physics/math knowledge implemented in code

Minerva: A dataset of challenging math problems used for evaluating reasoning capabilities