← Back to Paper List

Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification

Fernando Gabriela García, Qiyang Shi, Zilin Feng
arXiv (2025)
Factuality Reasoning RAG QA

📝 Paper Summary

Hallucination suppression Chain-of-Thought (CoT) Enhancement
VeriFact-CoT enhances LLM reliability by embedding a self-reflective loop into the reasoning process that identifies factual claims, generates verification queries, simulates evidence retrieval, and integrates citations without external tools.
Core Problem
LLMs frequently generate hallucinations and lack verifiable citation sources, while standard Chain-of-Thought (CoT) improves logic but not necessarily factual correctness.
Why it matters:
  • Deployment in critical domains (legal, medical, scientific) is restricted by the risk of fabricated information
  • Existing RAG methods depend on external retrieval quality and availability, which may not always be accessible or easily integrated
  • CoT alone guides reasoning steps but does not inherently verify the truthfulness of the statements made within those steps
Concrete Example: In complex QA, a standard CoT model might correctly reason through a logical sequence but hallucinate a specific date or name within that sequence. VeriFact-CoT catches this by pausing to ask 'is this claim factual?', simulating a check, and correcting the date before final output.
Key Novelty
Internal Simulated RAG within CoT
  • replaces external retrieval with a 'simulated' verification step where the LLM queries its own parametric knowledge as if it were an external database
  • integrates a four-stage pipeline (Reason → Claim Extraction → Simulated Verification → Refinement) purely through prompt engineering without model fine-tuning
Evaluation Highlights
  • In Complex Factual QA, improves factual accuracy to 83% compared to 72% for Standard CoT and 78% for CoT + Basic RAG
  • Reduces hallucination rate to 12% in QA tasks, down from 25% (Standard CoT) and 18% (CoT + Basic RAG)
  • Significantly improves citation quality (precision, relevance, verifiability) compared to baselines across summarization and explanatory tasks
Breakthrough Assessment
7/10
Offers a clever, fine-tuning-free prompting strategy that significantly boosts accuracy by simulating RAG behavior. However, relying on simulated verification limits the model to its pre-trained knowledge base unlike true RAG.
×