FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

📝 Paper Summary

Fact-checking evaluation Multi-agent simulation Hallucination detection

Fact-Audit is a multi-agent framework that dynamically generates and updates fact-checking test cases to adaptively uncover LLM weaknesses in both verdict prediction and justification reasoning.

Core Problem

Existing automated fact-checking evaluations rely on static datasets that risk data leakage, fail to adapt to model-specific weaknesses, and often oversimplify evaluation to binary classification accuracy without assessing reasoning.

Why it matters:

Static benchmarks (like FactCheckGP) quickly become obsolete or contaminated in pre-training data, leading to inflated performance scores (leaderboard swamping)
Measuring only accuracy ignores whether the model arrived at the correct verdict through valid reasoning or hallucinated logic
Manual annotation of new fact-checking scenarios is labor-intensive and cannot scale to cover the long-tail knowledge distribution of diverse LLMs

Concrete Example: When verifying a complex claim, a model might guess the correct verdict 'Refuted' but provide a nonsensical justification. Standard accuracy metrics count this as a success, whereas Fact-Audit detects the flaw by evaluating the justification and generating new, similar test cases to probe this specific reasoning failure.

Key Novelty

Importance Sampling for Adaptive Auditing

Treats dataset generation as an importance sampling process, where a proposal distribution is iteratively updated to focus on areas where the target model performs poorly (high error rate regions)
Uses a multi-agent loop (Appraiser, Inquirer, Inspector, Evaluator) to autonomously discover new fact-checking scenarios and generate challenging synthetic test cases based on previous failure feedback

Architecture

The Fact-Audit framework workflow, detailing the interaction between the four agents (Appraiser, Inquirer, Quality Inspector, Evaluator) and the iterative feedback loop.

Evaluation Highlights

Fact-Audit reveals that GPT-4o's performance drops significantly (e.g., -10% to -20% accuracy) on adaptively generated cases compared to static prototypes, exposing hidden weaknesses
Identifies that open-source models like Llama-3-70B-Instruct lag behind GPT-4o by notable margins in justification quality, even when verdict accuracy is comparable
Demonstrates that iterative probing uncovers twice as many failure cases in 'wisdom of crowds' scenarios compared to standard random sampling

Breakthrough Assessment

8/10

Significant shift from static to dynamic evaluation for fact-checking. The formalization as importance sampling and the focus on justification quality address major gaps in current benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Automated auditing of an LLM's fact-checking capabilities via dynamic test case generation

Inputs: A target LLM to be evaluated and an initial set of fact-checking scenarios (complex claims, fake news, rumors)

Outputs: An adaptive dataset of test cases, evaluation scores (verdict + justification), and updated scenario taxonomy focusing on model weaknesses

Pipeline Flow

Group 1: Prototype Generation (Appraiser → Inquirer → Quality Inspector)
Group 2: Evaluation & Probing (Evaluator → Prober)
Group 3: Adaptive Update (Appraiser updates Taxonomy)

System Modules

Appraiser (Prototype Generation)

Develops and updates the taxonomy of fact-checking scenarios based on model performance

Model or implementation: GPT-4o (implied dominant LLM)

Inquirer (Prototype Generation)

Generates specific prototype test cases (claim + evidence) based on the taxonomy

Model or implementation: GPT-4o (implied dominant LLM)

Quality Inspector (Prototype Generation)

Validates the quality and reliability of generated test cases using tools

Model or implementation: GPT-4o (implied dominant LLM) + Wikipedia API

Evaluator (Evaluation & Probing)

Scores the target LLM's verdict and justification quality

Model or implementation: GPT-4o (LLM-as-a-Judge)

Prober (Evaluation & Probing)

Generates new, harder test cases by perturbing instances where the model failed

Model or implementation: GPT-4o (implied)

Novel Architectural Elements

Iterative loop where the 'Appraiser' agent modifies the input taxonomy based on 'Evaluator' feedback to guide the 'Inquirer' toward model weaknesses (implementing importance sampling)
Integration of a 'Quality Inspector' that uses external tools (Wiki API) to validate synthetic evidence before evaluation

Modeling

Base Model: Framework is model-agnostic; Experiments use GPT-4o as the agent backbone

Comparison to Prior Work

vs. FactCheckGP: Fact-Audit is dynamic and iterative, updating test cases based on model failure, whereas FactCheckGP generates a fixed dataset once.
vs. Static Benchmarks (USB): Fact-Audit evaluates justification quality, not just binary verdict accuracy.
vs. Adversarial Attacks (general) [not cited in paper]: Unlike general adversarial attacks that add noise, Fact-Audit generates semantically valid, harder fact-checking scenarios.

Limitations

Relies heavily on the capability of the backbone agent (GPT-4o) to generate high-quality adversarial cases; if the backbone is weak, the audit is weak.
The 'oracle' distribution p(x) is approximated, and the upper bound calculation relies on this approximation.
Computational cost is high due to the multi-agent iterative process compared to static evaluation.
Requires external tool access (Wikipedia API) for quality inspection, which may limit offline usage.

Reproducibility

Code: https://github.com/DanielLin97/FACT-AUDIT

Code is publicly available at https://github.com/DanielLin97/FACT-AUDIT. The framework uses GPT-4o as the backbone for agents. Specific prompt templates are not detailed in the main text but likely in the repo.

📊 Experiments & Results

Evaluation Setup

Evaluated 13 LLMs across 3 settings: [claim] (internal knowledge), [evidence] (provided wiki evidence), and [wisdom of crowds] (social media comments).

Benchmarks:

Fact-Audit Generated Data (Fact Verification & Justification) [New]

Metrics:

Fact-checking Accuracy (Verdict)
Justification Score (1-10 scale via LLM-Judge)
Audit Score (Combined metric)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on the 'Wisdom of Crowds' setting, comparing Verdict Accuracy vs. Justification Quality.
Fact-Audit (Wisdom of Crowds)	Verdict Accuracy	0.7636	0.7727	+0.0091
Fact-Audit (Wisdom of Crowds)	Justification Score	7.73	8.55	+0.82
Impact of Adaptive Auditing: Comparing performance on initial 'Prototype' data vs. harder 'Probed' data.
Fact-Audit (Probing Effect)	Verdict Accuracy (GPT-4o)	0.7600	0.6667	-0.0933
Fact-Audit (Probing Effect)	Verdict Accuracy (GPT-4o)	0.9385	0.7727	-0.1658

Experiment Figures

Radar charts comparing 5 major LLMs across 6 dimensions of fact-checking capability (e.g., Numerical Reasoning, Temporal Reasoning, Multihop).

Main Takeaways

Models often achieve high verdict accuracy but low justification scores, indicating they may be guessing or using shallow heuristics rather than robust reasoning.
The iterative probing mechanism successfully lowers model performance compared to static prototypes, proving its ability to find 'blind spots' in the model's capabilities.
Open-source models (like Llama-3) compete well on verdict prediction but consistently lag behind proprietary models (GPT-4o) in generating coherent justifications.
The framework effectively differentiates between models that look similar on static leaderboards by exposing reasoning flaws in dynamic scenarios.

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs) and hallucination
Familiarity with fact-checking tasks (veracity prediction)
Knowledge of Monte Carlo methods and Importance Sampling

Key Terms

Importance Sampling: A statistical technique used here to generate test data that focuses on the 'long tail' of failure cases rather than random samples from the general distribution

Wisdom of Crowds: A fact-checking setting where user comments or social media threads serve as auxiliary information/evidence to verify a claim

Justification Production: The ability of a model to provide a logical explanation or evidence summary supporting its verdict, not just the label (True/False)

Prototype Emulation: The initial phase where agents generate a seed set of test cases based on a taxonomy before adaptive probing begins

LLM-as-a-Judge: Using a strong LLM (like GPT-4) to score the outputs of other models, used here by the Evaluator agent to grade justifications