AutoHall: Automated Factuality Hallucination Dataset Generation for Large Language Models

📝 Paper Summary

Hallucination detection Benchmark dataset creation

AutoHall automatically constructs model-specific hallucination datasets by prompting LLMs to generate references for known claims and then detecting contradictions between those references and ground-truth labels.

Core Problem

Existing hallucination datasets require expensive manual annotation and become obsolete quickly because hallucinations are model-specific; a claim hallucinated by Llama-2 might be answered correctly by GPT-4.

Why it matters:

Manual annotation is laborious, expensive, and difficult to scale across the rapid release of new LLMs
Datasets are time-sensitive; model upgrades (e.g., GPT-3.5 to GPT-4) change hallucination patterns, rendering old static benchmarks ineffective
Different models exhibit distinct types and rates of hallucination, necessitating model-specific evaluation rather than a 'one-size-fits-all' dataset

Concrete Example: When describing Jo Nesbø’s novel 'The Leopard', ChatGPT fabricates plot details not present in the book. A static dataset might not capture this specific error for a newer model, or might penalize a model that actually knows the book correctly.

Key Novelty

Automated Model-Specific Dataset Construction & Self-Contradiction Detection

Leverages existing fact-checking datasets (claims + labels) to prompt LLMs to generate 'references'; if a generated reference leads to an incorrect factuality classification, it is labeled as a hallucination
Introduces a zero-resource detection method that checks for 'self-contradiction' by comparing the original response against multiple new responses generated from functionally similar prompts

Architecture

The three-step pipeline for AutoHall dataset construction.

Evaluation Highlights

Proposed method outperforms SelfCheckGPT variants (BERTScore, NLI, Prompt) by margins ranging from +1.3% to +17% in AUC-ROC on Llama-2-70b
Estimated prevalence of factuality hallucination in current LLMs (ChatGPT, Llama 2) is between 20% and 30%
Demonstrates that LLMs are particularly susceptible to hallucinations in domain-specific topics like history, technology, and geography

Breakthrough Assessment

7/10

Offers a practical, scalable solution to the 'static benchmark' problem by automating dataset creation. The method is logical and effective, though primarily an engineering integration of existing concepts (self-consistency/contradiction) rather than a fundamental theoretical shift.

⚙️ Technical Details

Problem Definition

Setting: Binary classification of LLM outputs as either Factual or Hallucinatory

Inputs: User query Q=(P, X) containing prompt P and claim X, and the model-generated response Y

Outputs: Binary label: Hallucination (Y ∈ H) or Factual (Y ∈ H_bar)

Pipeline Flow

Reference Generation: Prompt LLM to generate references for claims
Claim Classification: Prompt LLM to classify claim as True/False based on generated reference
Hallucination Collection: Compare classification to ground truth; mismatches = hallucination
Detection Phase: Check for self-contradiction in accepted responses

System Modules

Reference Generator (Dataset Construction)

Generate supporting text (references) for a given claim from a fact-checking dataset

Model or implementation: Target LLM (e.g., ChatGPT, Llama-2)

Claim Classifier (Dataset Construction)

Determine if the claim is True or False based *only* on the generated reference

Model or implementation: Target LLM (same as generator)

Contradiction Detector

Compare original response against K alternatives to find inconsistencies

Model or implementation: Target LLM

Novel Architectural Elements

Automated loop using the model itself to validate its own hallucinations against ground truth labels from fact-checking datasets
Self-contradiction detection using functionally similar prompt perturbations rather than just sampling temperature variations

Modeling

Base Model: Evaluated on ChatGPT (gpt-3.5-turbo-0613), Llama-2-7b-chat, Llama-2-13b-chat, Llama-2-70b-chat

Training Method: Inference-only evaluation and dataset generation

Compute: Not reported in the paper

Comparison to Prior Work

vs. SelfCheckGPT: AutoHall constructs the evaluation dataset automatically rather than using fixed wikibio data; Detection uses prompt perturbations rather than just sampling variations
vs. FACTOOL/CRITIC: AutoHall is zero-resource (no external search/tools required) and focuses on internal consistency
vs. HaluEval [not cited in paper]: HaluEval uses separate models (ChatGPT) to generate synthetic hallucinations for others; AutoHall makes the *target* model generate its own hallucinations to ensure model-specificity

Limitations

Relies on the LLM's ability to correctly classify claims based on references (Step 2); if the model fails this reasoning step, labels may be noisy
The 'self-contradiction' method assumes that factual consistency implies truth, which isn't always the case (consistent hallucinations)
Focuses primarily on 'factuality' hallucination, potentially missing other types like faithfulness to user instructions
Requires existing fact-checking datasets with ground truth labels as seeds

Reproducibility

Code: https://github.com/zouyingcao/AutoHall

Code and datasets are publicly available at https://github.com/zouyingcao/AutoHall. The paper utilizes public fact-checking datasets as seeds. Exact prompts for generation and classification are provided in figures.

📊 Experiments & Results

Evaluation Setup

Detection of hallucinations in references generated by LLMs for fact-checking claims

Benchmarks:

AutoHall-Generated Dataset (Hallucination Detection) [New]

Metrics:

AUC-ROC (Area Under Curve - Receiver Operating Characteristic)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

An example of ChatGPT generating hallucinations about Jo Nesbø’s novel 'The Leopard', illustrating the problem context.

The inference-time hallucination detection method using self-contradiction.

Main Takeaways

Hallucination rates are significant (20-30%) across both open-source (Llama-2) and closed-source (ChatGPT) models
The proposed self-contradiction detection method consistently outperforms baselines (SelfCheckGPT variants) on the constructed datasets
Specific domains like history, technology, and geography trigger higher hallucination rates compared to others

📚 Prerequisite Knowledge

Prerequisites

Basic understanding of Large Language Models (LLMs)
Familiarity with the concept of Hallucination in NLP (specifically factuality)
Knowledge of Zero-Resource / Black-Box detection methods

Key Terms

AutoHall: The proposed automated pipeline for constructing model-specific hallucination datasets using existing fact-checking claims

Factuality Hallucination: The phenomenon where an LLM produces seemingly plausible but factually inaccurate or fabricated information

Self-contradiction: A detection method where the model is queried multiple times with similar prompts; conflicts among responses suggest the model is hallucinating

Zero-resource detection: Hallucination detection methods that rely only on the model's own outputs or internal states, without external knowledge bases (like Google Search)

Hallucinatory Reference: Supporting information generated by the LLM to substantiate a claim, which is factually incorrect or fabricated

AUC-ROC: Area Under the Receiver Operating Characteristic Curve—a performance metric for classification problems at various threshold settings