← Back to Paper List

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou
Google DeepMind, University of Illinois at Urbana-Champaign
International Conference on Learning Representations (2023)
Reasoning Factuality Benchmark

📝 Paper Summary

LLM Reliability Self-Correction Reasoning
Current LLMs cannot inherently correct their own reasoning errors without external feedback, often degrading performance by altering correct answers to incorrect ones when prompted to self-correct.
Core Problem
Prior research claims LLMs can 'self-correct' to improve performance, but these studies often rely on oracle labels (ground truth) to guide the process or use weak initial prompts.
Why it matters:
  • Reliance on oracle labels makes 'self-correction' impractical for real-world applications where ground truth is unknown
  • Deploying self-correction loops that degrade performance wastes computation and lowers system reliability
  • Misconceptions about LLM capabilities hinder the development of methods that actually improve reasoning, such as proper verification
Concrete Example: In GSM8K math problems, when GPT-3.5 is asked to review its answer without knowing if it's right or wrong, it changes a correct answer to an incorrect one more often than it fixes an error. For instance, it might correctly solve a word problem, but upon review, 'hallucinate' a mistake and change the final value.
Key Novelty
Critical Refutation of Intrinsic Self-Correction
  • Defines 'intrinsic self-correction' as correction without external feedback or oracle labels, isolating the model's inherent ability to spot errors
  • Demonstrates that reported gains in prior work vanish or reverse when oracle labels are removed
  • Shows that 'Multi-Agent Debate' gains are attributable to consistency (majority voting) rather than actual constructive critique
Evaluation Highlights
  • Intrinsic self-correction degrades GPT-3.5 accuracy on GSM8K from 77.4% to 75.9% (-1.5%) compared to standard prompting
  • Intrinsic self-correction degrades GPT-3.5 accuracy on CommonSenseQA from 66.8% to 55.4% (-11.4%)
  • Multi-agent debate (84.7%) performs identically to Self-Consistency (84.7%) when using the same number of model calls (3), offering no advantage over simple voting
Breakthrough Assessment
8/10
A significant 'negative result' paper that corrects the scientific record regarding LLM self-correction capabilities, exposing methodological flaws in highly cited prior work.
×