← Back to Paper List

Maximizing Confidence Alone Improves Reasoning

Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, Deepak Pathak
Carnegie Mellon University
arXiv.org (2025)
RL Reasoning Benchmark

📝 Paper Summary

Unsupervised Reinforcement Learning Test-Time Adaptation Reasoning
RENT improves language model reasoning without ground-truth labels by using reinforcement learning to minimize the entropy (uncertainty) of the model's generated reasoning steps.
Core Problem
Reinforcement learning for reasoning typically relies on ground-truth labels to define reward functions, which are often unavailable in real-world or open-ended scenarios.
Why it matters:
  • Reliance on labeled data restricts the applicability of RL to domains where external supervision is scarce or expensive
  • Existing test-time adaptation methods like majority voting (TTRL) are sparse and do not apply well to long-form free-response questions
  • Current reasoning models struggle to self-correct or improve in the absence of external feedback
Concrete Example: When a student takes an exam without an answer key, they cannot check if they are right (external reward), but they can refine their thinking until they feel certain (intrinsic confidence). Standard RL cannot do this; it requires the answer key.
Key Novelty
RENT (Reinforcement Learning via Entropy Minimization)
  • Uses the model's own output confidence (negative entropy) as the sole reward signal, requiring no ground-truth answers
  • Identifies that minimizing uncertainty in the 'last chunk' of the reasoning chain—rather than the beginning or the specific answer tokens alone—correlates best with accuracy
Evaluation Highlights
  • Outperforms format-based rewards and majority-voting (TTRL) baselines across GSM8K, MATH500, AMC, AIME, and GPQA benchmarks [Numeric values not in source text]
  • Demonstrates consistent accuracy gains across multiple model families (Qwen, Mistral, Llama) and sizes (1.5B to 8B) using only intrinsic rewards
  • Empirically validates that 'last chunk' token entropy correlates significantly better with accuracy than 'first chunk' or specific answer token entropy
Breakthrough Assessment
8/10
Proposed method successfully improves reasoning using strictly unsupervised intrinsic rewards, a significant step toward self-improving models independent of labeled data.
×