Large Language Models Cannot Self-Correct Reasoning Yet

📝 Paper Summary

LLM Reliability Self-Correction Reasoning

Current LLMs cannot inherently correct their own reasoning errors without external feedback, often degrading performance by altering correct answers to incorrect ones when prompted to self-correct.

Core Problem

Prior research claims LLMs can 'self-correct' to improve performance, but these studies often rely on oracle labels (ground truth) to guide the process or use weak initial prompts.

Why it matters:

Reliance on oracle labels makes 'self-correction' impractical for real-world applications where ground truth is unknown
Deploying self-correction loops that degrade performance wastes computation and lowers system reliability
Misconceptions about LLM capabilities hinder the development of methods that actually improve reasoning, such as proper verification

Concrete Example: In GSM8K math problems, when GPT-3.5 is asked to review its answer without knowing if it's right or wrong, it changes a correct answer to an incorrect one more often than it fixes an error. For instance, it might correctly solve a word problem, but upon review, 'hallucinate' a mistake and change the final value.

Key Novelty

Critical Refutation of Intrinsic Self-Correction

Defines 'intrinsic self-correction' as correction without external feedback or oracle labels, isolating the model's inherent ability to spot errors
Demonstrates that reported gains in prior work vanish or reverse when oracle labels are removed
Shows that 'Multi-Agent Debate' gains are attributable to consistency (majority voting) rather than actual constructive critique

Evaluation Highlights

Intrinsic self-correction degrades GPT-3.5 accuracy on GSM8K from 77.4% to 75.9% (-1.5%) compared to standard prompting
Intrinsic self-correction degrades GPT-3.5 accuracy on CommonSenseQA from 66.8% to 55.4% (-11.4%)
Multi-agent debate (84.7%) performs identically to Self-Consistency (84.7%) when using the same number of model calls (3), offering no advantage over simple voting

Breakthrough Assessment

8/10

A significant 'negative result' paper that corrects the scientific record regarding LLM self-correction capabilities, exposing methodological flaws in highly cited prior work.

⚙️ Technical Details

Problem Definition

Setting: Intrinsic Self-Correction for Reasoning Tasks

Inputs: Natural language question Q, initial model response A_0

Outputs: Refined answer A_final

Pipeline Flow

Initial Generation (Standard Prompting)
Feedback Generation (Self-Correction Prompt)
Revision (Final Answer Generation)

System Modules

Initial Generator (Generation)

Produces the first attempt at an answer

Model or implementation: GPT-3.5-Turbo / GPT-4 / Llama-2-70b

Feedback Generator

Reviews the initial answer to identify potential errors without external labels

Model or implementation: Same as Initial Generator

Revision Module (Generation)

Generates the final answer based on the critique

Model or implementation: Same as Initial Generator

Modeling

Base Model: Evaluated on GPT-3.5-Turbo (gpt-3.5-turbo-0613), GPT-4, GPT-4-Turbo (gpt-4-1106-preview), Llama-2-70b-chat

Compute: Not reported in the paper (Inference-only evaluation)

Comparison to Prior Work

vs. Reflexion/RCI: This paper removes the 'oracle' assumption, showing that without ground truth to stop the loop, performance drops.
vs. Self-Refine: This paper shows that optimizing the initial prompt (Standard Prompting) often outperforms the self-refine loop.
vs. Multi-Agent Debate: This paper demonstrates that debate offers no gain over Self-Consistency when controlling for sample count.

Limitations

Analysis focuses primarily on reasoning tasks (Math, QA); self-correction might still work for style or safety alignment
Experiments limited to 2 rounds of self-correction
Did not evaluate fine-tuning models specifically to self-correct (focused on prompting pre-trained LLMs)

Reproducibility

Prompts for self-correction and standard generation are provided in the paper and Appendix. Specific model versions (e.g., gpt-3.5-turbo-0613) are specified. Code URL is not explicitly provided in the text.

📊 Experiments & Results

Evaluation Setup

Zero-shot or Few-shot prompting on standard reasoning benchmarks

Benchmarks:

GSM8K (Grade school math reasoning)
CommonSenseQA (Commonsense reasoning (multiple choice))
HotpotQA (Multi-hop Question Answering)

Metrics:

Accuracy
Exact Match
Concept Coverage (for constrained generation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Experiments on Intrinsic Self-Correction (no oracle) show performance degradation across multiple models and benchmarks.
GSM8K	Accuracy	77.4	75.9	-1.5
GSM8K	Accuracy	92.0	91.5	-0.5
CommonSenseQA	Accuracy	66.8	55.4	-11.4
CommonSenseQA	Accuracy	63.0	51.0	-12.0
Comparison of Multi-Agent Debate against Self-Consistency (controlling for compute cost).
GSM8K	Accuracy	84.7	84.7	0.0
GSM8K	Accuracy	86.8	84.7	-2.1
Re-evaluation of Constrained Generation (Madaan et al.) with improved initial prompts.
Constrained Generation	Concept Coverage	61.3	66.7	+5.4

Experiment Figures

Bar charts showing the transition of answer correctness after self-correction (Correct→Correct, Correct→Incorrect, Incorrect→Correct, Incorrect→Incorrect).

Main Takeaways

Oracle labels create an illusion of self-correction capability; without them, models struggle to identify their own errors.
Models are more likely to change a correct answer to an incorrect one than to fix an incorrect answer (e.g., in GSM8K).
Multi-agent debate essentially functions as a consistency check (majority vote) rather than a reasoning refinement process.
Claims of self-correction success in prior work often stem from sub-optimal initial prompts; fixing the prompt is more effective than adding a correction loop.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs)
Standard Prompting vs. Chain-of-Thought (CoT)
Self-Consistency decoding

Key Terms

intrinsic self-correction: A setting where the LLM attempts to correct its initial response based solely on its own capabilities, without external feedback or ground truth labels

oracle labels: Using ground-truth correctness information (e.g., from a test set) to decide whether to stop or continue the self-correction loop

Self-Consistency: A decoding strategy that samples multiple reasoning paths and selects the most consistent answer (majority voting)

Multi-Agent Debate: A method where multiple LLM instances critique each other's responses to reach a consensus

CommonSenseQA: A multiple-choice question answering dataset requiring commonsense knowledge

GSM8K: A dataset of grade school math word problems requiring multi-step reasoning

HotpotQA: A dataset for multi-hop question answering