Deductive Verification of Chain-of-Thought Reasoning

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Hallucination Mitigation Self-Verification

By formatting reasoning as a 'Natural Program' and decomposing verification into step-by-step checks that isolate only necessary premises, LLMs can rigorously self-verify their own deductive logic.

Core Problem

Standard Chain-of-Thought prompting often introduces hallucinations and accumulated logic errors because LLMs struggle to verify entire reasoning chains accurately when distracted by irrelevant context.

Why it matters:

Current LLMs like ChatGPT fail to identify reasoning mistakes when verifying full chains (approx. 50% accuracy), limiting reliability in complex tasks
Errors in early reasoning steps cause a snowball effect, compounding mistakes in subsequent steps
Existing solutions like code-based reasoning cannot capture nuances of natural language (e.g., moral reasoning or quantifiers like 'likely')

Concrete Example: When verifying a long reasoning chain, ChatGPT typically outputs 'Correct' regardless of validity because it considers all text (irrelevant premises) simultaneously. The paper notes it has a 50% accuracy rate on this task.

Key Novelty

Natural Program-based Step-by-Step Verification

Decomposes verification into individual subprocesses where the model only sees the specific step and its explicitly cited necessary premises, removing distracting context
Introduces 'Natural Program', a strict format where premises are numbered and every reasoning step explicitly cites the premise numbers it derives from (e.g., 'Step 3 (by #1, #2)')
Uses Unanimity-Plurality Voting: samples multiple chains, filters for fully valid ones (unanimity of valid steps), and votes on the final answer (plurality)

Architecture

Overview of the Natural Program-based deductive reasoning and verification process (inferred from text description)

Breakthrough Assessment

7/10

Proposes a logical, structured approach to fixing CoT hallucinations without external solvers. While the provided text lacks final performance numbers, the methodology addresses a fundamental flaw in current self-verification techniques.

⚙️ Technical Details

Problem Definition

Setting: Reasoning-based Question Answering (QA) where a model generates a reasoning chain S = (s_1, ... s_m) to reach an answer

Inputs: Question Q and Context C

Outputs: A deductively valid reasoning chain S and a final answer A

Pipeline Flow

Premise Extraction (LLM lists question-related premises with labels)
Natural Program Generation (LLM generates steps citing premises)
Step-by-Step Verification (LLM verifies each step using only cited premises)
Filtering & Voting (Discard chains with invalid steps; vote on remaining)

System Modules

Premise Extractor (Generation)

Extracts and labels all statements/relationships from the question and context

Model or implementation: GPT-3.5-turbo (implied)

Reasoning Generator (Generation)

Generates reasoning steps in Natural Program format

Model or implementation: GPT-3.5-turbo (implied)

Step Verifier

Verifies the deductive validity of a single step

Model or implementation: GPT-3.5-turbo (implied)

Novel Architectural Elements

Strict 'Natural Program' syntax enforcement via in-context learning to enable algorithmic decomposition of the verification task
Context-isolated verification pipeline (verifying s_i using only p_i rather than full history)

Modeling

Base Model: OpenAI's GPT-3.5-turbo

Compute: Not reported in the paper

Comparison to Prior Work

vs. Self-Consistency: Focuses on the deductive validity of the *process*, not just the answer consensus
vs. Program-aided CoT: Uses natural language instead of Python/SQL, allowing for qualitative reasoning (e.g., 'likely', moral reasoning)
vs. Reflexion: Decomposes verification into isolated steps with strict premise scoping to prevent context distraction

Limitations

Relies on the model's ability to adhere to the strict 'Natural Program' format via in-context learning
Requires explicit extraction of premises, which might be difficult for highly ambiguous contexts

Reproducibility

Code: https://github.com/lz1oceani/verify_cot

📊 Experiments & Results

Evaluation Setup

Verification of reasoning chains generated by LLMs on arithmetic and commonsense tasks

Benchmarks:

Arithmetic datasets (Reasoning QA)
Commonsense datasets (Reasoning QA)

Metrics:

Verification Accuracy
Deductive Validity
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Various reasoning datasets	Verification Accuracy	100	50	-50

Main Takeaways

Naïve 'Let's think step by step' verification fails because models get distracted by irrelevant premises in the full context.
Most reasoning chains that pass the proposed strict verification are found to be rigorous, while rejected chains typically contain imprecise elements even if the final answer is correct.
Reliable self-verification is possible if the process is decomposed into steps containing *only* necessary premises.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Deductive reasoning (premises and conclusions)
In-context learning

Key Terms

Natural Program: A structured natural language format proposed by the authors where premises are numbered and reasoning steps explicitly cite the premises they use

Deductive Validity: A binary metric indicating whether a specific reasoning step logically follows from its cited premises, regardless of the final answer's correctness

Unanimity-Plurality Voting: A generation strategy where the model discards any reasoning chain with even a single invalid step (unanimity), then votes among the remaining valid chains (plurality) for the answer

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps

Snowball effect: The phenomenon where a single mistake in an early reasoning step causes subsequent steps to become incorrect, leading to a wrong final answer