Maximizing Confidence Alone Improves Reasoning

📝 Paper Summary

Unsupervised Reinforcement Learning Test-Time Adaptation Reasoning

RENT improves language model reasoning without ground-truth labels by using reinforcement learning to minimize the entropy (uncertainty) of the model's generated reasoning steps.

Core Problem

Reinforcement learning for reasoning typically relies on ground-truth labels to define reward functions, which are often unavailable in real-world or open-ended scenarios.

Why it matters:

Reliance on labeled data restricts the applicability of RL to domains where external supervision is scarce or expensive
Existing test-time adaptation methods like majority voting (TTRL) are sparse and do not apply well to long-form free-response questions
Current reasoning models struggle to self-correct or improve in the absence of external feedback

Concrete Example: When a student takes an exam without an answer key, they cannot check if they are right (external reward), but they can refine their thinking until they feel certain (intrinsic confidence). Standard RL cannot do this; it requires the answer key.

Key Novelty

RENT (Reinforcement Learning via Entropy Minimization)

Uses the model's own output confidence (negative entropy) as the sole reward signal, requiring no ground-truth answers
Identifies that minimizing uncertainty in the 'last chunk' of the reasoning chain—rather than the beginning or the specific answer tokens alone—correlates best with accuracy

Evaluation Highlights

Outperforms format-based rewards and majority-voting (TTRL) baselines across GSM8K, MATH500, AMC, AIME, and GPQA benchmarks [Numeric values not in source text]
Demonstrates consistent accuracy gains across multiple model families (Qwen, Mistral, Llama) and sizes (1.5B to 8B) using only intrinsic rewards
Empirically validates that 'last chunk' token entropy correlates significantly better with accuracy than 'first chunk' or specific answer token entropy

Breakthrough Assessment

8/10

Proposed method successfully improves reasoning using strictly unsupervised intrinsic rewards, a significant step toward self-improving models independent of labeled data.

⚙️ Technical Details

Problem Definition

Setting: Unsupervised Reinforcement Learning / Test-Time Adaptation where the model updates its policy on the test set without labels

Inputs: Reasoning question x (e.g., math problem)

Outputs: Chain-of-thought and final answer y_pred

Pipeline Flow

Policy Generation (LLM generates response)
Entropy Calculation (Compute uncertainty of specific tokens)
Reward Formulation (Reward = -Entropy)
Optimization (GRPO update)

System Modules

Policy Model

Generates the chain-of-thought and final answer given the input prompt

Model or implementation: Various (e.g., Qwen2.5-7B-Instruct)

Entropy Calculator (Reward Calculation)

Computes the entropy of the probability distribution for each generated token

Model or implementation: Mathematical formula

Reward Aggregator (Reward Calculation)

Aggregates token entropies into a single reward scalar, focusing on the 'last chunk' of the response

Model or implementation: Heuristic selection

Optimizer

Updates the policy model to maximize the expected reward

Model or implementation: GRPO Algorithm

Novel Architectural Elements

Use of 'last chunk' negative entropy as a dense intrinsic reward signal for unsupervised reasoning improvements

Modeling

Base Model: Qwen2.5-7B-Instruct (and variants: Mistral-7B, Llama3.1-8B, Qwen-Math)

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Minimize uncertainty in the generated response.

Formally: Maximize Reward r = -H(π(x)), where H is the entropy of the token distributions.

Adaptation: Test-time adaptation (training on the test set without labels)

Key Hyperparameters:

learning_rate: 1e-6

Compute: Not reported in the paper

Comparison to Prior Work

vs. TTRL: RENT uses a dense, continuous entropy signal rather than a sparse binary voting signal, enabling application to free-form generation
vs. Intuitor: RENT uses entropy (reverse KL) which is mode-covering, whereas Intuitor is mode-seeking
vs. Tent: RENT applies entropy minimization within a Reinforcement Learning framework (GRPO) for sequential generation, rather than just optimizing prediction marginals
+ 1 more
vs. Format Reward: RENT encourages confidence in the *content* of the reasoning, not just the structure

Limitations

No numeric results tables included in the provided text (qualitative descriptions only)
Requires the model to have some initial capability; if the base model cannot reason at all, confidence maximization may not help
Risk of overconfidence (collapsing to low-entropy but incorrect answers), though the paper claims 'last chunk' selection mitigates this
Performance depends heavily on the token selection strategy (e.g., 'last chunk' vs 'id_match')

Reproducibility

Code: https://rent-rl.github.io/

📊 Experiments & Results

Evaluation Setup

Unsupervised adaptation on test sets (using the test set for 'training' without labels, then evaluating)

Benchmarks:

GSM8K (Grade-school math word problems)
MATH500 (Competition math problems)
AMC (High school math competition (AMC12))
AIME24 (Advanced invitational math exam)
GPQA (PhD-level science QA)

Metrics:

Accuracy
Entropy / Confidence
Statistical methodology: Standard deviations reported over multiple samples (5, 32, 64, 10 depending on dataset)

Experiment Figures

Plots of accuracy and confidence throughout training for Qwen2.5-Math-7B (AMC) and Qwen2.5-7B-Instruct (MATH500)

Correlation between negative entropy and accuracy for different token selection strategies

Main Takeaways

Minimizing entropy via RL (RENT) consistently improves reasoning accuracy across diverse math and science benchmarks without ground-truth supervision.
The 'last chunk' of tokens in a chain-of-thought response contains the most valuable confidence signal; minimizing entropy here correlates strongly with accuracy.
Surprisingly, minimizing entropy specifically on the final answer tokens (id_match) is less effective, suggesting the model's token-level confidence on the final output symbol is not well-calibrated.
RENT outperforms baseline unsupervised methods including Format Reward (syntax only), Spurious (random) rewards, and Test-Time RL (majority voting), particularly on harder tasks like AIME.
Generalization holds across different model families (Qwen, Mistral, Llama) and sizes, indicating the method is model-agnostic.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (Policy Optimization)
Language Model decoding (token probabilities)
Information Theory (Entropy)

Key Terms

RENT: Reinforcement Learning via Entropy Minimization—the proposed method using negative entropy as a reward

Entropy: A measure of the uncertainty or 'spread' of a probability distribution; lower entropy means the model is more confident in its token choice

GRPO: Group Relative Policy Optimization—an RL algorithm that improves a policy by comparing outputs against a group of baselines rather than using a critic model

TTRL: Test-Time Reinforcement Learning—a baseline method that typically uses majority voting as a sparse reward signal

Chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

Last chunk: The strategy of calculating entropy only on the final segment of the model's response, which the paper finds correlates best with correctness