Improving reasoning at inference time via uncertainty minimisation

📝 Paper Summary

Inference-time scaling Reasoning Uncertainty estimation

A verifier-free inference method that selects intermediate reasoning steps by maximizing the model's internal self-certainty, improving performance by stabilizing early planning without external supervision.

Core Problem

Existing inference-time scaling methods are computationally expensive (requiring full rollouts) or unreliable (token-level uncertainty is noisy), while external verifiers require costly training.

Why it matters:

Token-level metrics often conflate epistemic and aleatoric uncertainty, leading to confident but incorrect hallucinations
Full-chain sampling (e.g., Best-of-N) wastes compute on dead-end paths that could be pruned earlier
Reasoning requires dynamic uncertainty resolution (planning), which static decoding strategies fail to capture

Concrete Example: When solving a math problem, a model might transiently increase uncertainty while formulating a plan. Token-level greedy decoding might pick a high-probability generic phrase that leads to a dead end, whereas maximizing 'thought-level' self-certainty selects the specific sub-derivation the model is most committed to internally.

Key Novelty

Thought-Level Self-Certainty Maximization

Shift the unit of analysis from tokens to 'thoughts' (intermediate reasoning steps defined by delimiters) to capture semantic coherence
Select the next reasoning step from k samples by maximizing the average KL divergence between the predictive distribution and a uniform distribution (self-certainty)
Use internal signals exclusively, removing the need for trained verifiers or external reward models

Evaluation Highlights

Up to 4x accuracy improvement on Danish GSM8K using Qwen-1.5B compared to greedy decoding
Matches or exceeds Self-Consistency (Majority Voting) baselines on MATH500 and GSM8K under comparable token budgets
Sampling only during the first 1–5 reasoning steps achieves peak accuracy, outperforming sampling at every step (inverted U-shape performance)

Breakthrough Assessment

7/10

Offers a principled, compute-efficient alternative to Majority Voting that relies purely on internal signals. The finding that early-step uncertainty minimization drives performance is a significant insight into LLM reasoning dynamics.

⚙️ Technical Details

Problem Definition

Setting: Step-by-step text generation for multi-step reasoning tasks

Inputs: Natural language question x and previously generated thoughts

Outputs: A sequence of reasoning steps leading to a final answer

Pipeline Flow

Input Question -> Step Sampler (Sample k candidates) -> Scorer (Calculate Self-Certainty) -> Selector (Pick max C) -> Append to Context -> Repeat

System Modules

Step Sampler

Generate k candidate continuations for the current reasoning step

Model or implementation: Base LLM (Qwen/Llama)

Certainty Scorer

Compute the self-certainty score for each candidate step

Model or implementation: Statistical Calculation (KL Divergence)

Step Selector

Select the best candidate to append to the reasoning chain

Model or implementation: Argmax

Novel Architectural Elements

Step-wise control loop based on internal entropy signals (Self-Certainty) rather than external rewards or final-answer consistency

Modeling

Base Model: Qwen2.5-Instruct (0.5B, 1.5B, 3B) and Llama-3.2-Instruct (1B, 3B)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Self-Consistency: Operates online at the step level rather than aggregating full completed paths
vs. PRMs: Uses internal model signals (self-certainty) instead of requiring a separately trained verifier
vs. Token-level Entropy: Aggregates uncertainty over semantic 'thoughts' to reduce noise and aleatoric confusion

Limitations

Computational overhead compared to simple greedy decoding due to sampling k candidates at each step
Performance degradation observed if sampling is applied to all steps rather than just early steps (over-optimization)
Relies on the assumption that model confidence correlates with correctness, which can fail in miscalibrated models

Reproducibility

Code: https://github.com/centre-for-humanities-computing/m-gsm-symbolic/

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning with Chain-of-Thought prompting

Benchmarks:

MATH500 (Competition-level mathematics)
GSM8K (Grade school math word problems)
Danish GSM8K (Translated math reasoning (Low/Mid resource language)) [New]

Metrics:

Accuracy
Statistical methodology: Variability estimated using 8 generations per problem for greedy baseline

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Danish GSM8K	Accuracy relative improvement	1.0	4.0	+3.0

Experiment Figures

Evolution of self-certainty gains along the reasoning trajectory.

Accuracy of Qwen-3B on MATH500 when sampling is restricted to the first k steps.

Main Takeaways

Self-certainty maximization consistently matches or exceeds Self-Consistency (Majority Voting) and Greedy decoding across Qwen and Llama model sizes.
Method transfers robustly to Danish (low-resource setting), suggesting the uncertainty signal is language-agnostic.
Analysis of dynamics shows correct trajectories converge to high certainty early (first ~20 steps), while incorrect ones maintain high uncertainty.
Strategic budget allocation: Sampling is most effective in the first 1-5 steps (planning phase); sampling at later steps yields diminishing or negative returns.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Kullback-Leibler (KL) Divergence
Language Model Sampling (Greedy, Top-k)

Key Terms

self-certainty: A metric quantifying model confidence, calculated as the average KL divergence between the model's predicted token distribution and a uniform distribution

thought level: Operating on coherent semantic units (intermediate reasoning steps) rather than individual tokens

inference-time scaling: Allocating more computation during the generation phase (e.g., via sampling or search) to improve performance without retraining

epistemic uncertainty: Uncertainty stemming from a lack of knowledge or model ambiguity, as opposed to inherent randomness in the data

greedy decoding: A generation strategy that always selects the highest-probability next token