← Back to Paper List

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, Hao Peng
University of Illinois Urbana-Champaign
arXiv.org (2025)
Reasoning RL Benchmark

📝 Paper Summary

Unsupervised Fine-tuning Reinforcement Learning (RL) for Reasoning Inference-time Scaling
Minimizing the entropy of a pre-trained LLM's outputs, without any labeled data or external supervision, significantly improves its performance on complex math and coding reasoning tasks.
Core Problem
Standard post-training methods like Supervised Fine-Tuning (SFT) and RL require expensive labeled data or reward models, and it is unclear if models can self-improve using only their pre-trained capabilities.
Why it matters:
  • Labeled data for complex reasoning tasks (e.g., scientific coding) is scarce, expensive to annotate, and often hard to verify automatically
  • Pre-trained models likely already possess latent reasoning capabilities that are underutilized by standard decoding strategies
  • Current self-improvement methods often rely on majority voting or outcome verification, which are inapplicable when answers cannot be easily extracted or verified (e.g., creative coding)
Concrete Example: In scientific coding tasks like SciCode where output verification is hard, a standard model might generate diverse but incorrect solutions due to high uncertainty. EM forces the model to 'commit' to its most confident path, often recovering the correct solution where exploration would fail.
Key Novelty
Entropy Minimization (EM) as a standalone objective
  • Treats high confidence as a proxy for correctness in capable pre-trained models, training them to simply be 'more sure' of their own generations
  • Introduces three unlabeled methods: EM-FT (fine-tuning on model samples to minimize token entropy), EM-RL (RL with negative entropy as the only reward), and EM-INF (inference-time logit adjustment)
  • demonstrates that reducing uncertainty alone—without ground truth labels or verifiers—can elicit strong reasoning behaviors
Evaluation Highlights
  • Qwen-32B with EM-INF matches or exceeds GPT-4o and Claude 3 Opus on the challenging SciCode benchmark
  • EM-RL on Qwen-7B outperforms strong labeled RL baselines (GRPO, RLOO trained on 60K labeled examples) on LeetCode and Minerva math tasks without seeing a single label
  • EM-FT improves base model performance by ~8% on average across math and coding tasks using only unlabeled prompts
Breakthrough Assessment
8/10
Surprisingly effective simple objective that challenges the assumption that external supervision is needed for reasoning improvements. Performance matching labeled baselines is a significant finding for unsupervised learning.
×