On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks

📝 Paper Summary

LLM Reasoning Self-Correction/Self-Verification Automated Planning

LLMs struggle to self-verify or self-correct in formal reasoning tasks, often performing worse than standard prompting, but external sound verifiers can significantly boost performance even without complex critique.

Core Problem

There is a widespread belief that LLMs can self-critique and improve their own solutions iteratively, assuming verification is easier than generation, but this has not been systematically tested on formal reasoning tasks with ground truth.

Why it matters:

Current optimism about self-reflection agents (e.g., Reflexion) relies on the assumption that models can accurately spot their own errors, which may be flawed
Misattributing performance gains to 'self-critique' rather than just iterative guessing obscures the actual source of improvement (often just having a verifier)
Reliability in reasoning domains (planning, math) is critical, and false confidence in self-correction can lead to compounding errors

Concrete Example: In Graph Coloring, an LLM might propose a solution where two connected nodes share a color. When asked to verify, it often hallucinates that the constraint is satisfied or identifies non-existent edges, rejecting valid solutions or accepting invalid ones, leading to lower final accuracy than its initial guess.

Key Novelty

Systematic Ablation of Self-Verification

Separates the roles of the LLM into generator, verifier, and critiquer to isolate where failures occur in iterative loops
Demonstrates that 'self-correction' often degrades performance due to high false positive/negative rates in the LLM's verification step
Shows that performance gains in iterative systems come primarily from the presence of a sound external verifier and repeated guessing, not from the semantic content of the critique

Architecture

The iterative prompting architecture used for evaluation, showing the loop between the LLM and the Verifier/Critique modules.

Evaluation Highlights

In Graph Coloring, self-verification (LLM+LLM) degraded accuracy from 16% (standard prompting) to 2%, while a sound verifier boosted it to 38%
In Mystery Blocksworld, self-verification collapsed performance to 0%, whereas a sound verifier achieved 10%
Mere sampling (re-prompting the LLM 15 times with a sound verifier but NO feedback/critique) matched or exceeded the performance of complex feedback loops (e.g., 40% vs 37% in Graph Coloring)

Breakthrough Assessment

7/10

Strong negative result that challenges the prevailing narrative of 'emergent self-reflection'. Crucial for grounding agentic AI research, though it primarily evaluates GPT-4 on specific formal tasks.

⚙️ Technical Details

Problem Definition

Setting: Iterative solution generation and verification on tasks with formally verifiable ground truths

Inputs: Natural language descriptions of reasoning problems (Game of 24, Graph Coloring, STRIPS Planning)

Outputs: A valid solution (expression, coloring assignment, or plan) verified by either the LLM itself or an external tool

Pipeline Flow

Generator (Propose Solution)
Verifier (Check Solution)
Critique Generator (If wrong, explain why)
Backprompter (Feed history + critique back to Generator)

System Modules

Generator

Generate candidate solutions for the problem instance

Model or implementation: GPT-4

Verifier

Determine if the proposed solution is correct

Model or implementation: GPT-4 OR External Sound Verifier (SymPy, Python script, VAL)

Critique Generator

Generate textual feedback explaining the error

Model or implementation: GPT-4 OR External Sound Verifier

Novel Architectural Elements

Comparative ablation framework: Systematically swapping the Verifier and Critique modules between 'LLM' and 'Sound External Tool' to measure the contribution of each component
Sampling baseline: Removing critique entirely to test if history/feedback matters or if improvement is just due to repeated trials

Modeling

Base Model: GPT-4

Compute: Not reported in the paper

Comparison to Prior Work

vs. Reflexion: Shows that in formal reasoning (unlike creative writing), self-critique often hurts performance due to hallucinations in verification
vs. Self-Refine: Finds that the 'semantic' content of the feedback is largely irrelevant; gains come from the verifier's soundness and re-sampling
vs. CRITIC: Confirms CRITIC's observation of performance drops but extends analysis to ablating the critique content entirely
+ 1 more
vs. Tree of Thoughts: Focuses specifically on the verification/critique loop rather than tree search structures

Limitations

Evaluated only on GPT-4; other models (Claude 3, etc.) might have better verification capabilities
Limited to three specific domains (Game of 24, Graph Coloring, STRIPS planning)
Does not explore fine-tuning the LLM specifically for verification tasks
Relies on specific prompt engineering which is known to be brittle

📊 Experiments & Results

Evaluation Setup

Formal reasoning tasks with boolean correctness

Benchmarks:

Game of 24 (Arithmetic puzzle)
Graph Coloring (Constraint Satisfaction Problem) [New]
Blocksworld / Mystery Blocksworld (STRIPS Planning)

Metrics:

Accuracy (percentage of solved instances)
False Positive Rate (verifier)
False Negative Rate (verifier)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Self-Verification (LLM+LLM) performance collapses compared to Standard Prompting across most domains due to poor verification accuracy.
Graph Coloring	Accuracy	16	2	-14
Game of 24	Accuracy	5	3	-2
Mystery Blocksworld	Accuracy	4	0	-4
Use of a Sound Verifier (LLM+Sound) significantly improves performance, confirming the generator is capable if correctly guided.
Graph Coloring	Accuracy	16	38	+22
Game of 24	Accuracy	5	36	+31
Sampling (repeated guessing with sound verification but NO critique text) achieves similar performance to complex feedback loops, suggesting critique content is not the main driver.
Graph Coloring	Accuracy	34	40	+6
Blocksworld	Accuracy	83	68	-15
LLM Verifier reliability analysis shows high error rates, explaining the collapse of the LLM+LLM method.
Graph Coloring	False Negative Rate	0	95.8	+95.8

Experiment Figures

Performance (number of correct instances) vs. Number of Iterations (backprompts) for both Sound Verifiers and LLM Verifiers.

Main Takeaways

LLMs are poor verifiers in formal reasoning domains, often exhibiting high false negative rates (rejecting correct answers) which causes performance collapse in self-correction loops.
The perceived benefits of 'self-correction' in literature may be largely misattributed; in these experiments, simple re-sampling with a sound verifier (guessing until correct) outperformed or matched complex feedback loops.
Content of critique (Binary vs. First Error vs. All Errors) matters surprisingly little; the existence of a stop signal from a sound verifier is the primary driver of performance.
Future reasoning systems should adopt LLM-Modulo frameworks: use LLMs for hypothesis generation but rely strictly on external sound verifiers for checking.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model prompting (standard vs. iterative)
Basic knowledge of NP-complete problems (Graph Coloring)
Familiarity with automated planning (STRIPS/PDDL)

Key Terms

STRIPS: Stanford Research Institute Problem Solver—a formal language for describing automated planning problems using initial states, goals, and actions with preconditions/effects

PDDL: Planning Domain Definition Language—a standard encoding format for automated planning problems

Graph Coloring: An NP-complete problem where the goal is to assign colors to graph vertices such that no two connected vertices share the same color

Game of 24: A math puzzle where 4 numbers must be combined using arithmetic operations to equal 24

Sound Verifier: An external, non-LLM program (e.g., Python script, VAL) that is guaranteed to correctly validate if a solution is true or false

LLM-Modulo: A framework where LLMs are used for generation but are checked and guided by external symbolic verifiers/solvers

False Negative Rate: The rate at which the verifier (LLM) incorrectly rejects a valid solution

Backprompting: The process of feeding the LLM's previous output (and potentially critique) back into it to request a correction