Towards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework

📝 Paper Summary

Hallucination suppression Multi-agent

This paper proposes a hallucination detection framework that uses multiple LLM agents (Trust, Skeptic, Leader) engaging in a Markov Chain-based debate to verify claims against retrieved evidence.

Core Problem

Existing hallucination detection methods often rely on simple prompting or static decomposition, neglecting the crucial verification step where errors persist even after accurate claim extraction.

Why it matters:

LLMs frequently generate inaccurate content (hallucinations), particularly in potent but opaque models like ChatGPT and GPT-4, necessitating reliable detection mechanisms.
Training-based interventions are expensive and complex, while existing post-processing methods lack the nuance of human-like debate, leading to lower verification accuracy.

Concrete Example: In a fact-checking procedure, a system might correctly extract a claim and retrieve evidence, but the final verdict stage fails because a single agent simply accepts a plausible-sounding but false claim without rigorous scrutiny or counter-argument.

Key Novelty

Markov Chain-based Multi-agent Debate

Simulates human debate dynamics where the current discussion state depends on the immediate previous outcome, allowing agents to switch between 'trusting' and 'skeptical' modes.
Deploys three distinct agent personas (Trust, Skeptic, Leader) that dynamically transition between verifying credibility and challenging inconsistencies until a consensus is reached.

Architecture

The workflow of the Markov Chain-based multi-agent debate verification framework.

Evaluation Highlights

Achieves 89.2% accuracy on the HaluEval-Dialogue benchmark, outperforming the standard ChatGPT baseline (84.4%) by +4.8%.
Outperforms FacTool by +8.6% in accuracy on the HaluEval-QA task (85.6% vs 77.0%).
Demonstrates superior performance across three distinct generative tasks: Question Answering, Summarization, and Dialogue.

Breakthrough Assessment

7/10

Novel application of Markov Chains to structure multi-agent debates for hallucination detection. Shows consistent improvements over strong baselines like FacTool, though relies on existing LLMs without architectural changes.

⚙️ Technical Details

Problem Definition

Setting: Post-hoc hallucination detection in text generated by Large Language Models.

Inputs: A generated response $R$ from an LLM and the corresponding prompt/context.

Outputs: A binary verification result (Factual/Non-factual) for extracted claims and the overall response.

Pipeline Flow

Claim Detection: Extract atomic claims from the response.
Evidence Retrieval: Generate queries and retrieve evidence via Google API or local knowledge.
Multi-agent Verification: Agents debate the validity of claims using a Markov Chain structure.

System Modules

Claim Extractor

Decomposes complex model responses into atomic, verifiable claims using ChatGPT prompts.

Model or implementation: ChatGPT (gpt-3.5-turbo-0613)

Evidence Retriever

Retrieves external information to validate claims.

Model or implementation: Google API / Local Search

Debate Agents (Trust, Skeptic, Leader)

Engage in a structured debate to verify claims based on retrieved evidence.

Model or implementation: ChatGPT (gpt-3.5-turbo-0613)

Novel Architectural Elements

Markov Chain state machine governing agent interactions: Transitions between 'Trust-initiated' and 'Skeptic-initiated' discussion modes based on previous verdict.
Three-agent role system (Trust, Skeptic, Leader) specifically designed for alternating verification rigor.

Modeling

Base Model: ChatGPT (gpt-3.5-turbo-0613)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Factool: Replaces the simple verdict prediction stage with a complex multi-agent debate system.
vs. CoVE: Uses external agents and evidence rather than relying solely on the generating model's internal knowledge.
vs. MAD (Multi-Agent Debate) [not cited in paper]: Uses a Markov Chain to structure the flow rather than free-form or round-robin debate.

Limitations

Relies on the performance of the underlying LLM (ChatGPT) for debate and reasoning.
Latency and cost increase due to multiple agent calls per claim compared to single-pass verification.
Retrieval accuracy heavily influences the debate quality; poor evidence leads to poor verdicts.

Reproducibility

Prompt templates for agents (Trust, Skeptic, Leader) and claim extraction are provided in Appendix A. Code availability is not mentioned.

📊 Experiments & Results

Evaluation Setup

Tested on three generative tasks: Knowledge-based QA, Text Summarization, and Dialogue Generation.

Benchmarks:

HaluEval (Hallucination Evaluation (QA, Dialogue, Summarization))

Metrics:

Accuracy
Precision
Recall
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on HaluEval-QA task showing improvement over baselines.
HaluEval-QA	Accuracy	77.0	85.6	+8.6
HaluEval-QA	Precision	82.9	95.6	+12.7
Main comparison on HaluEval-Dialogue task.
HaluEval-Dialogue	Accuracy	84.4	89.2	+4.8
Main comparison on HaluEval-Summarization task.
HaluEval-Summarization	Accuracy	86.6	89.8	+3.2

Main Takeaways

The Markov Chain-based debate framework consistently outperforms single-agent baselines (ChatGPT) and previous frameworks (Factool) across diverse tasks.
The method is particularly effective in improving Precision, significantly reducing false positives in hallucination detection.
Ablation studies (implied by the design, though specific numbers for ablation are in the appendix/analysis) confirm the contribution of the multi-agent structure.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and hallucination
Knowledge of Multi-Agent Systems
Understanding of Markov Chains (states and transitions)

Key Terms

Markov Chain: A stochastic model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event.

Hallucination: Generated content from an LLM that is nonsensical or unfaithful to the provided source or real-world facts.

KBQA: Knowledge Base Question Answering—tasks involving answering questions based on structured knowledge bases.

SFT: Supervised Fine-Tuning—training a model on a labeled dataset to adapt it to a specific task.

Factool: A baseline fact-checking framework utilized for comparison and claim extraction methodology.