Enhancing Mathematical Reasoning in Large Language Models with Self-Consistency-Based Hallucination Detection

📝 Paper Summary

Hallucination suppression Mathematical reasoning verification Self-consistency

Structured Self-Consistency extends majority voting by verifying the logical and structural coherence of intermediate reasoning steps in mathematical derivations, rather than just checking final answers.

Core Problem

Standard self-consistency methods in LLMs focus on final answer agreement, neglecting the logical validity of intermediate steps in complex multi-step mathematical reasoning.

Why it matters:

Mathematical hallucinations are binary and propagating; a single incorrect intermediate step invalidates the entire chain even if the final answer appears correct
Existing verification methods (fine-tuning, external verifiers) are computationally expensive or require domain-specific architectural changes
Current self-consistency approaches fail to detect 'cascading errors' where plausible but unsound logic leads to answers that coincidentally align with the majority

Concrete Example: In a proof, an LLM might claim 'P implies Q' from premises that only support 'P implies R'. Standard self-consistency might miss this if the final result 'Q' is popular, whereas structured verification would detect the invalid logical link.

Key Novelty

Structured Self-Consistency (SSC)

Hierarchical verification: Validates reasoning at three levels—atomic statements (embedding similarity), logical dependencies (validity of transitions), and global structure (graph isomorphism)
Probabilistic structural modeling: Treats mathematical derivations as directed acyclic graphs (DAGs) and computes consistency scores based on how often specific structural patterns appear across sampled responses
Adaptive sampling: Dynamically increases sample count only when structural consistency is low, terminating early for high-agreement cases to save compute

Evaluation Highlights

Proof validity improved by 8.3% (p < 0.01) in formal theorem proving tasks compared to baseline approaches
Numerical stability increased by 42.8% in computation tasks, significantly reducing arithmetic hallucinations
Computational overhead reduced by 56.3% via adaptive sampling while maintaining accuracy comparable to fixed large-sample methods

Breakthrough Assessment

8/10

Significant efficiency gains and a theoretically grounded approach to intermediate step verification make this a strong contribution to reliable mathematical reasoning, though bounded by the need for multiple samples.

⚙️ Technical Details

Problem Definition

Setting: Verification of multi-step mathematical reasoning chains generated by LLMs

Inputs: Mathematical problem prompt (theorem, symbolic expression, or numerical problem)

Outputs: Verified reasoning chain and final answer, with hallucinated paths filtered out

Pipeline Flow

Sampling: Generate k reasoning paths
Graph Construction: Parse paths into dependency graphs
Hierarchical Verification: Compute statement, edge, and graph consistency scores
Adaptive Control: Determine if more samples are needed
Selection/Repair: Output best graph or repair hallucinated nodes

System Modules

Sampler

Generate initial set of reasoning paths using the base LLM

Model or implementation: Base LLM (e.g., GPT-4, Claude 3)

Graph Parser

Convert text reasoning into structured representations (DAGs for proofs, ASTs for symbols)

Model or implementation: Algorithmic parser

Hierarchical Verifier

Compute consistency scores at atomic, dependency, and global levels

Model or implementation: Statistical/Embedding modules

Adaptive Controller

Decide whether to stop or sample more based on current consistency

Model or implementation: Threshold-based logic

Novel Architectural Elements

Hierarchical verification pipeline integrating atomic, dependency, and global structural checks
Graph-isomorphism-based consistency metric for mathematical derivations
Feedback loop for adaptive sampling based on structural consistency scores

Modeling

Base Model: Evaluated on GPT-4, Claude 3, Gemini Ultra, Mixtral 8x22B

Comparison to Prior Work

vs. Majority Voting: SSC verifies the derivation path structure, not just the final token
vs. CoT: SSC adds a post-generation verification layer that filters hallucinations
vs. ToT: SSC uses ensemble statistics (consistency) rather than self-evaluation prompts or heuristics to judge path validity
+ 1 more
vs. Self-Refine [not cited in paper]: SSC uses parallel consistency rather than iterative self-correction prompted by the same model

Limitations

Relies on the assumption that correct reasoning is more probable/consistent than hallucinations (the 'consistency assumption')
Computational cost is higher than single-sample generation, though reduced by adaptive sampling
Graph parsing from natural language math proofs can be noisy or ambiguous

Reproducibility

Not provided (code availability not mentioned in text). Detailed algorithms for scoring and sampling are described mathematically.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning across three domains: Theorem Proving, Symbolic Manipulation, Numerical Computation

Benchmarks:

Formal Theorem Proving (Logical deduction)
Symbolic Transformation (Algebraic manipulation)
Numerical Computation (Arithmetic calculation)

Metrics:

Proof Validity (%)
Symbolic Reasoning Accuracy (%)
Numerical Stability (Statistical Dispersion)
Computational Overhead (Cost reduction)
Statistical methodology: p-values reported for proof validity improvements (< 0.01)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Structured Self-Consistency (SSC) demonstrates significant improvements over baseline methods across all three mathematical domains.
Formal Theorem Proving	Proof Validity improvement	0.0	8.3	+8.3
Symbolic Transformation	Reasoning Accuracy improvement	0.0	9.6	+9.6
Numerical Computation	Numerical Stability improvement	0.0	42.8	+42.8
All Tasks	Computational Overhead Reduction	0.0	56.3	56.3
Human Evaluation	Pearson Correlation (ρ)	1.0	0.87	-0.13

Main Takeaways

Structured verification significantly outperforms simple answer-based verification in complex mathematical tasks
The 'cascading error' phenomenon in math (where one error invalidates the chain) requires global structural checks (graph isomorphism) rather than just local checks
Adaptive sampling effectively mitigates the computational cost of self-consistency by allocating more compute only to uncertain/inconsistent cases
The method generalizes across logic (proofs), algebra (symbolic), and arithmetic (numerical) domains

📚 Prerequisite Knowledge

Prerequisites

Self-Consistency (SC) in LLMs
Directed Acyclic Graphs (DAGs)
Graph Isomorphism
Tree Edit Distance

Key Terms

Self-Consistency (SC): A decoding strategy that samples multiple reasoning paths and selects the most consistent answer (usually via majority vote)

Hallucination: Generated content that appears plausible but contains factual inaccuracies or logical inconsistencies

Reasoning Graph: A Directed Acyclic Graph (DAG) representation of a mathematical derivation where nodes are statements and edges are logical dependencies

Graph Isomorphism: A condition where two graphs contain the same number of vertices connected in the same way, used here to check if two reasoning chains have the same logical structure

Tree Edit Distance: A metric counting the minimum number of operations (insert, delete, rename) required to transform one tree (e.g., an algebraic syntax tree) into another

Adaptive Sampling: A strategy that dynamically adjusts the number of model outputs generated based on the current confidence or consistency level

Chain-of-Thought (CoT): A prompting technique that encourages the model to generate intermediate reasoning steps before the final answer