By formatting reasoning as a 'Natural Program' and decomposing verification into step-by-step checks that isolate only necessary premises, LLMs can rigorously self-verify their own deductive logic.
Core Problem
Standard Chain-of-Thought prompting often introduces hallucinations and accumulated logic errors because LLMs struggle to verify entire reasoning chains accurately when distracted by irrelevant context.
Why it matters:
Current LLMs like ChatGPT fail to identify reasoning mistakes when verifying full chains (approx. 50% accuracy), limiting reliability in complex tasks
Errors in early reasoning steps cause a snowball effect, compounding mistakes in subsequent steps
Existing solutions like code-based reasoning cannot capture nuances of natural language (e.g., moral reasoning or quantifiers like 'likely')
Concrete Example:When verifying a long reasoning chain, ChatGPT typically outputs 'Correct' regardless of validity because it considers all text (irrelevant premises) simultaneously. The paper notes it has a 50% accuracy rate on this task.
Key Novelty
Natural Program-based Step-by-Step Verification
Decomposes verification into individual subprocesses where the model only sees the specific step and its explicitly cited necessary premises, removing distracting context
Introduces 'Natural Program', a strict format where premises are numbered and every reasoning step explicitly cites the premise numbers it derives from (e.g., 'Step 3 (by #1, #2)')
Uses Unanimity-Plurality Voting: samples multiple chains, filters for fully valid ones (unanimity of valid steps), and votes on the final answer (plurality)
Architecture
Overview of the Natural Program-based deductive reasoning and verification process (inferred from text description)
Breakthrough Assessment
7/10
Proposes a logical, structured approach to fixing CoT hallucinations without external solvers. While the provided text lacks final performance numbers, the methodology addresses a fundamental flaw in current self-verification techniques.
⚙️ Technical Details
Problem Definition
Setting: Reasoning-based Question Answering (QA) where a model generates a reasoning chain S = (s_1, ... s_m) to reach an answer
Inputs: Question Q and Context C
Outputs: A deductively valid reasoning chain S and a final answer A
Pipeline Flow
Premise Extraction (LLM lists question-related premises with labels)
Natural Program Generation (LLM generates steps citing premises)
Step-by-Step Verification (LLM verifies each step using only cited premises)
Filtering & Voting (Discard chains with invalid steps; vote on remaining)
System Modules
Premise Extractor (Generation)
Extracts and labels all statements/relationships from the question and context
Model or implementation: GPT-3.5-turbo (implied)
Reasoning Generator (Generation)
Generates reasoning steps in Natural Program format
Model or implementation: GPT-3.5-turbo (implied)
Step Verifier
Verifies the deductive validity of a single step
Model or implementation: GPT-3.5-turbo (implied)
Novel Architectural Elements
Strict 'Natural Program' syntax enforcement via in-context learning to enable algorithmic decomposition of the verification task
Context-isolated verification pipeline (verifying s_i using only p_i rather than full history)
Modeling
Base Model: OpenAI's GPT-3.5-turbo
Compute: Not reported in the paper
Comparison to Prior Work
vs. Self-Consistency: Focuses on the deductive validity of the *process*, not just the answer consensus
vs. Program-aided CoT: Uses natural language instead of Python/SQL, allowing for qualitative reasoning (e.g., 'likely', moral reasoning)
vs. Reflexion: Decomposes verification into isolated steps with strict premise scoping to prevent context distraction
Limitations
Relies on the model's ability to adhere to the strict 'Natural Program' format via in-context learning
Requires explicit extraction of premises, which might be difficult for highly ambiguous contexts
Verification of reasoning chains generated by LLMs on arithmetic and commonsense tasks
Benchmarks:
Arithmetic datasets (Reasoning QA)
Commonsense datasets (Reasoning QA)
Metrics:
Verification Accuracy
Deductive Validity
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
Various reasoning datasets
Verification Accuracy
100
50
-50
Main Takeaways
Naïve 'Let's think step by step' verification fails because models get distracted by irrelevant premises in the full context.
Most reasoning chains that pass the proposed strict verification are found to be rigorous, while rejected chains typically contain imprecise elements even if the final answer is correct.
Reliable self-verification is possible if the process is decomposed into steps containing *only* necessary premises.
📚 Prerequisite Knowledge
Prerequisites
Chain-of-Thought (CoT) prompting
Deductive reasoning (premises and conclusions)
In-context learning
Key Terms
Natural Program: A structured natural language format proposed by the authors where premises are numbered and reasoning steps explicitly cite the premises they use
Deductive Validity: A binary metric indicating whether a specific reasoning step logically follows from its cited premises, regardless of the final answer's correctness
Unanimity-Plurality Voting: A generation strategy where the model discards any reasoning chain with even a single invalid step (unanimity), then votes among the remaining valid chains (plurality) for the answer
CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps
Snowball effect: The phenomenon where a single mistake in an early reasoning step causes subsequent steps to become incorrect, leading to a wrong final answer