← Back to Paper List

Enhancing Mathematical Reasoning in Large Language Models with Self-Consistency-Based Hallucination Detection

Mingshan Liu, Shi Bo, Jialing Fang
The Hong Kong University of Science and Technology, Fudan University
arXiv.org (2025)
Factuality Reasoning Benchmark

📝 Paper Summary

Hallucination suppression Mathematical reasoning verification Self-consistency
Structured Self-Consistency extends majority voting by verifying the logical and structural coherence of intermediate reasoning steps in mathematical derivations, rather than just checking final answers.
Core Problem
Standard self-consistency methods in LLMs focus on final answer agreement, neglecting the logical validity of intermediate steps in complex multi-step mathematical reasoning.
Why it matters:
  • Mathematical hallucinations are binary and propagating; a single incorrect intermediate step invalidates the entire chain even if the final answer appears correct
  • Existing verification methods (fine-tuning, external verifiers) are computationally expensive or require domain-specific architectural changes
  • Current self-consistency approaches fail to detect 'cascading errors' where plausible but unsound logic leads to answers that coincidentally align with the majority
Concrete Example: In a proof, an LLM might claim 'P implies Q' from premises that only support 'P implies R'. Standard self-consistency might miss this if the final result 'Q' is popular, whereas structured verification would detect the invalid logical link.
Key Novelty
Structured Self-Consistency (SSC)
  • Hierarchical verification: Validates reasoning at three levels—atomic statements (embedding similarity), logical dependencies (validity of transitions), and global structure (graph isomorphism)
  • Probabilistic structural modeling: Treats mathematical derivations as directed acyclic graphs (DAGs) and computes consistency scores based on how often specific structural patterns appear across sampled responses
  • Adaptive sampling: Dynamically increases sample count only when structural consistency is low, terminating early for high-agreement cases to save compute
Evaluation Highlights
  • Proof validity improved by 8.3% (p < 0.01) in formal theorem proving tasks compared to baseline approaches
  • Numerical stability increased by 42.8% in computation tasks, significantly reducing arithmetic hallucinations
  • Computational overhead reduced by 56.3% via adaptive sampling while maintaining accuracy comparable to fixed large-sample methods
Breakthrough Assessment
8/10
Significant efficiency gains and a theoretically grounded approach to intermediate step verification make this a strong contribution to reliable mathematical reasoning, though bounded by the need for multiple samples.
×