← Back to Paper List

Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability

Ninad Naik
arXiv.org (2024)
Factuality QA Reasoning

📝 Paper Summary

Hallucination suppression Factuality verification
A framework using an ensemble of multiple LLMs to validate content through probabilistic consensus, significantly improving factual precision without external knowledge bases.
Core Problem
LLMs operate probabilistically, leading to precision errors (hallucinations) and accuracy errors (bias) that make them unreliable for high-stakes domains like healthcare and law.
Why it matters:
  • Errors compound dramatically through multiple reasoning steps (e.g., error rate increases from 26.9% to 99.8% over 20 steps)
  • Existing solutions like RAG are limited by non-deterministic retrieval and the currency of external sources
  • Human-in-the-loop verification introduces latency and limits scalability
Concrete Example: When asked about the Cabinet Secretariat in India, a single model (Claude 3.5 Sonnet) struggled with ambiguities regarding its establishment date and structure, leading to incorrect responses. The ensemble framework filtered these by requiring consensus, as different models disagreed on the ambiguous points.
Key Novelty
Ensemble Validation Framework
  • Repurposes ensemble methods—typically used for performance boosting—specifically for content validation by intersecting the probability distributions of multiple independent models
  • Relies on the statistical principle that while individual models may hallucinate, they are unlikely to hallucinate the exact same error independently (independence of failure modes)
Evaluation Highlights
  • Improved precision from 73.1% (single model baseline) to 95.6% (3-model consensus) on complex reasoning cases
  • Reduced error compounding risk significantly: projected 20-step error rate drops from 99.8% to 59.5%
  • Achieved strong inter-model agreement (Cohen's Kappa > 0.76) while maintaining sufficient disagreement to catch errors
Breakthrough Assessment
7/10
Simple but highly effective application of ensemble theory to validation. While the method (voting) is standard, the application to source-independent LLM fact-checking with strong empirical results is valuable.
×