MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs

📝 Paper Summary

Long-form generation evaluation Hallucination detection Factuality verification

The paper introduces a Chinese long-form factuality dataset (LongHalluQA) and a multi-agent debate verification system (MAD-Fact) that weighs facts by importance to better evaluate LLM hallucinations.

Core Problem

Existing factuality benchmarks focus on short-form English content, while single-model evaluators often hallucinate or fail to distinguish between crucial errors and minor details in long texts.

Why it matters:

Long-form generation is critical for real-world applications (biomedicine, law) where hallucinations can spread misinformation
Current metrics treat all claims equally, penalizing minor auxiliary errors as heavily as central factual contradictions
A severe lack of comprehensive Chinese long-form benchmarks hinders the development and safety assessment of domestic LLMs

Concrete Example: In a text about the Zhuang ethnic group, a model might correctly state their location (central fact) but error on a minor detail like a specific brocade pattern (auxiliary fact). Current metrics weight these errors identically, misrepresenting the text's overall reliability.

Key Novelty

Hierarchical Multi-Agent Debate for Weighted Factuality (MAD-Fact + LongHalluQA)

Constructs a large-scale Chinese dataset by expanding short QA pairs into complex long-form queries using a knowledge-base-driven pipeline
Employs a multi-agent debate system (Clerk, Jury, Judge) to verify claims, reducing the bias and inconsistency found in single-model evaluators
Introduces a 'fact importance hierarchy' (inspired by the Pyramid Method) to assign different weights to claims, ensuring critical errors impact scores more than minor ones

Architecture

The MAD-Fact system architecture, detailing the multi-agent debate process.

Evaluation Highlights

MAD-Fact metrics correlate strongly with human judgments (Pearson r=0.701), significantly better than unweighted approaches
The constructed LongHalluQA dataset contains 2,746 samples with an average response length 9.4 times longer than original short-text datasets
MAD-Fact consistently outperforms strong baselines like SAFE and FIRE on multiple long-form benchmarks

Breakthrough Assessment

8/10

Addresses a significant gap in non-English long-form evaluation. The combination of a new large-scale dataset, multi-agent verification, and importance-weighted metrics offers a comprehensive solution to hallucination measurement.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of long-form text generated by LLMs against factual reality

Inputs: A question q and a generated long-form response a

Outputs: A factuality evaluation score s reflecting the accuracy and importance of claims in a

Pipeline Flow

Clerk Agent: Decompose Response → Atomic Claims
Jury Agents: Debate & Verify Claims (using Retrieval)
Judge Agent: Aggregate Verdicts → Final Prediction
Importance Scoring: Assign Weights → Calculate Weighted Metric

System Modules

Clerk Agent

Decomposes the long-form response into multiple atomic claims

Model or implementation: Not explicitly specified (implied LLM)

Jury (Evaluator Agents)

Assess factuality of atomic claims through external retrieval and multi-agent debate

Model or implementation: Diverse roles (assumed LLM-based)

Judge Agent

Aggregates individual jury evaluations to produce a final factuality prediction

Model or implementation: LLM

Novel Architectural Elements

Integration of a 'fact importance hierarchy' model directly into the evaluation metric calculation
Structured multi-agent debate specifically designed for claim verification (Clerk-Jury-Judge topology)

Modeling

Base Model: DeepSeek-V3 (used for dataset construction and verification tasks)

Training Method: Not reported in the paper (Evaluation system focus)

Adaptation: None

Trainable Parameters: None

Compute: Not reported in the paper

Comparison to Prior Work

vs. SAFE: SAFE uses a single model and treats all facts equally; MAD-Fact uses multi-agent debate and weights facts by importance
vs. FActScore: MAD-Fact adds importance weighting and multi-agent verification to mitigate single-verifier hallucinations
vs. FactBench: FactBench considers verifiability but not the hierarchical importance of facts [not cited in paper]

Limitations

Dependency on the performance of the underlying LLMs used for the agents (Clerk, Jury, Judge)
Computational cost of multi-agent debate is higher than single-model verification
Retrieval reliability (Google Search API) impacts the knowledge base quality
Evaluation is currently focused on Chinese and English; other languages not tested

Reproducibility

The paper describes the dataset construction pipeline and the MAD-Fact system architecture in detail. However, code URLs and specific prompt templates are not provided in the text. The dataset (LongHalluQA) is described but no download link is explicitly provided in the excerpt.

📊 Experiments & Results

Evaluation Setup

Benchmarking mainstream LLMs on long-form factuality tasks

Benchmarks:

LongHalluQA (Chinese long-form factuality generation) [New]
LongFact (Long-form factuality generation (English focus implied))

Metrics:

Weighted F1 score (incorporating fact importance)
Pearson correlation (r) with human judgment
Statistical methodology: Pearson correlation coefficient reported with p-value

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation analysis demonstrating the effectiveness of the proposed weighted metric compared to human judgment.
Human Judgment Correlation	Pearson r	Not reported in the paper	0.701	Not reported in the paper
Dataset statistics highlighting the scale and complexity of the new benchmark.
LongHalluQA vs Original Datasets	Average Response Length Increase	1.0	9.4	8.4

Experiment Figures

Illustration of the Fact Importance Hierarchy.

Construction pipeline for LongHalluQA.

Main Takeaways

Larger LLMs generally maintain higher factual consistency in long-form generation.
Domestic (Chinese) models perform better on Chinese-specific content compared to general non-Chinese models.
The proposed weighted evaluation metric aligns better with human perception of quality than unweighted metrics, as it distinguishes between central and auxiliary errors.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with atomic claim decomposition
Knowledge of Multi-Agent Systems (MAS)

Key Terms

LongHalluQA: The proposed Chinese long-form factuality dataset constructed via knowledge-base expansion

MAD-Fact: Multi-Agent Debate for Factual verification—the proposed evaluation system

Atomic Claim: The smallest indivisible unit of information in a text that can be independently verified

RAG: Retrieval-Augmented Generation—using external data retrieval to ground LLM outputs

Pyramid Method: A content evaluation method that organizes information hierarchically based on importance

F1 score: The harmonic mean of precision and recall

Hallucination: Generated content that is nonsensical or unfaithful to the provided source or factual reality