Learning From Correctness Without Prompting Makes LLM Efficient Reasoner

📝 Paper Summary

Self-correction in LLMs Multi-step reasoning Confidence estimation

LeCo improves complex reasoning by identifying high-confidence steps using internal probability statistics and iteratively appending these correct steps to the input to guide subsequent generation.

Core Problem

LLMs often hallucinate or fail in multi-step reasoning, and existing self-correction methods rely heavily on expensive human feedback, external tools, or labor-intensive prompt engineering.

Why it matters:

Reliance on external tools limits model autonomy and increases cost/latency.
Prompt-based self-correction often fails because models struggle to find their own errors just by being asked.
Crafting effective correction prompts turns researchers into 'prompt engineers' with inconsistent results across tasks.

Concrete Example: In a math problem requiring multiple steps, a standard LLM might make an error in step 3 but continue reasoning. Existing methods ask 'Find the error', which the model often fails to do. LeCo instead identifies that steps 1-2 have high confidence, appends them to the prompt as ground truth, and asks the model to continue from there, bypassing the error.

Key Novelty

Learning from Correctness (LeCo)

Inverts the standard 'debug the error' paradigm: instead of asking the model to fix mistakes, it identifies reliable steps and forces the model to build upon them.
Uses a prompt-free, logit-based metric to calculate confidence for each reasoning step, combining token probabilities, distribution divergence, and transition probabilities.

Architecture

The workflow of the LeCo framework compared to standard prompting.

Evaluation Highlights

LeCo identifies approximately 65% of incorrect steps using its intrinsic confidence metric without external tools.
Improves reasoning performance on benchmarks (arithmetic, commonsense, logical) while reducing token consumption compared to standard self-correction methods.
Effective across both closed-source (GPT-3.5/4) and open-source (DeepSeek) models without needing model fine-tuning.

Breakthrough Assessment

7/10

Offers a clever, efficient alternative to prompt-engineering-heavy self-correction. The metric design is intuitive and the 'progressive' approach aligns well with how humans solve problems, though reliance on logits limits applicability to some API-restricted models.

⚙️ Technical Details

Problem Definition

Setting: Multi-step reasoning tasks where a final answer is derived through a sequence of intermediate reasoning steps.

Inputs: A natural language question/problem x

Outputs: A sequence of reasoning steps resulting in a final answer y

Pipeline Flow

Initial Generation: Model generates full reasoning path.
Confidence Estimation: Calculate confidence for each step using logits.
Truncation: Identify the first low-confidence step.
Iterative Refinement: Append high-confidence steps to input and regenerate.

System Modules

Initial Generator (Inference)

Generates the initial Chain-of-Thought reasoning path

Model or implementation: LLM (e.g., GPT-3.5, DeepSeek)

Confidence Calculator

Computes confidence score for each step based on logits

Model or implementation: Deterministic Algorithm (Eq. 4)

Step Selector

Identifies the cut-off point for correctness

Model or implementation: Heuristic

Rethink Generator (Inference)

Regenerates the solution starting after the last correct step

Model or implementation: LLM (same as Initial Generator)

Novel Architectural Elements

Logit-based Step Confidence Metric: A composite score using token averages, KL-divergence for uniformity, and inter-step transition probabilities.
Correctness-based Iteration: Loop logic that truncates at low confidence and forces generation from that point, rather than prompting for self-correction.

Modeling

Base Model: Evaluated on GPT-3.5-Turbo-Instruct, GPT-4, DeepSeek-7B-Chat, DeepSeek-67B-Chat

Comparison to Prior Work

vs. Self-Correction: LeCo does not prompt the model to find errors; it mathematically detects low confidence and truncates.
vs. Tool-based methods: LeCo is intrinsic and requires no external verifiers or tools.
vs. Prompting strategies (CoT): LeCo is a wrapper around CoT that intervenes during decoding based on logits [not cited in paper, but conceptually distinct].

Limitations

Requires access to output logits, which some API providers do not fully expose.
The method assumes that low confidence correlates with incorrectness, which is not always true (calibration issue).
Computationally overhead exists due to iterative regeneration, though paper claims reduced token consumption compared to full self-correction loops.

Reproducibility

Code: https://github.com/starrYYxuan/LeCo

Code is publicly available at https://github.com/starrYYxuan/LeCo. The method relies on access to token log probabilities (logits), which is available for open-source models and some commercial APIs (like OpenAI's completion endpoints), but may be restricted in chat-based APIs.

📊 Experiments & Results

Evaluation Setup

Multi-step reasoning tasks across arithmetic, commonsense, and logic domains.

Benchmarks:

GSM8K (Arithmetic Reasoning)
StrategyQA (Commonsense Reasoning)
Aqua (Arithmetic Reasoning)
Date Understanding (Commonsense Reasoning (BigBench))
Object Tracking (Logical Reasoning (BigBench))

Metrics:

Accuracy
Token Consumption
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

LeCo consistently improves reasoning accuracy across varying model sizes (7B to GPT-4) and tasks.
The logit-based confidence metric effectively identifies error steps (approx 65% detection rate), validating the hypothesis that models are uncertain when they hallucinate.
Unlike standard self-correction which often increases token usage significantly, LeCo reduces consumption by truncating and reusing correct prefixes.

📚 Prerequisite Knowledge

Prerequisites

Language Model decoding (logits, softmax)
Chain-of-Thought (CoT) reasoning
Kullback-Leibler (KL) Divergence

Key Terms

LeCo: Learning from Correctness—the proposed framework that iteratively accumulates high-confidence reasoning steps.

Logits: The raw, unnormalized prediction scores generated by the model for each token in the vocabulary before Softmax is applied.

KL Divergence: Kullback-Leibler Divergence—a statistical distance: a measure of how one probability distribution is different from a second, reference probability distribution.

Step Divergence Score: A metric proposed in this paper to measure how uniform the token probabilities are within a reasoning step; higher uniformity is preferred.

Greedy Decoding: A decoding strategy where the model always selects the highest probability token at each step.

Softmax: A mathematical function that converts a vector of numbers (logits) into a vector of probabilities, where the probabilities of each value are proportional to the relative scale of each value in the vector.