HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

📝 Paper Summary

Hallucination evaluation Factuality benchmarking

HALOGEN is a comprehensive benchmark evaluating 14 LLMs across 9 domains using automated atomic fact verification, revealing high hallucination rates and classifying errors based on their presence in pretraining data.

Core Problem

Measuring LLM hallucinations is difficult due to the open-ended nature of generation and the high cost of human verification, leaving the root causes (training data vs. fabrication) largely unknown.

Why it matters:

Hallucinations in critical domains like code generation and scientific attribution pose security risks and spread misinformation
Current benchmarks often lack diversity, focusing on single domains or failing to distinguish between incorrect recollection and pure fabrication
Understanding whether hallucinations stem from incorrect training data or model over-generalization is essential for building trustworthy models

Concrete Example: When asked to write a Python program to 'stack columns to rows', a model might import a non-existent library (e.g., 'pandas_reshaper'). HALOGEN detects this by decomposing the code into import statements and verifying them against the PyPi index.

Key Novelty

HALOGEN (Evaluating Hallucinations of Generative Models)

A multi-domain benchmark (10k+ prompts) covering both response-based tasks (e.g., coding) and refusal-based tasks (e.g., historical meetings that never happened)
Automated high-precision verifiers that decompose text into atomic units (e.g., specific citations, code imports) and check them against external knowledge sources or entailment models
A novel taxonomy of hallucination causes: Type A (failed recall of correct training data), Type B (recollection of incorrect training data), and Type C (fabrication not in training data)

Architecture

The HALOGEN evaluation workflow for two example domains: Code Generation and Scientific Attribution.

Evaluation Highlights

Even best-performing models like GPT-4 hallucinate significantly, with hallucination rates up to 86% depending on the domain
Models frequently fail to abstain when they should: GPT-4 answers 29% of refusal-based prompts (like false historical meetings) instead of declining
Hallucinated code packages often exist in pretraining data (Type B error), appearing in up to 72% of hallucinations for Llama-3-70B, whereas historical hallucinations are often fabrications (Type C)

Breakthrough Assessment

9/10

Extensive scale (150k generations), diverse domains (code, science, history), and a novel causal analysis framework linking hallucinations back to pretraining corpora make this a significant resource.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of generative LLM factuality across diverse open-ended and grounded tasks

Inputs: Prompt x belonging to a task set X (either response-based or refusal-based)

Outputs: Model response y composed of atomic facts P_y, and a Utility Score combining factuality and appropriate refusal

Pipeline Flow

Input Prompt Generation (spanning 9 domains)
Model Generation
Decomposition Engine (breaks response into atomic units)
Verification (checks each unit against external source/tool)
Metric Calculation (Hallucination Score, Response Ratio, Utility Score)

System Modules

Prompt Generator

Generate 10,923 prompts across 9 scenarios (Code, Summarization, Biography, etc.)

Model or implementation: Various (curated from datasets like StackOverflow, CNN/DM, SciFact)

Decomposition Engine (Evaluation)

Break down model response into verifiable atomic units

Model or implementation: Task-specific (e.g., GPT-3.5 for text, regex for code imports)

Verifier (Evaluation)

Verify the factuality of each atomic unit against a ground truth source

Model or implementation: Task-specific (e.g., PyPi index for code, Semantic Scholar for citations, Entailment model for text)

Novel Architectural Elements

Dual-track evaluation framework supporting both 'Response-Based' (must answer) and 'Refusal-Based' (must abstain) tasks within a single benchmark
Integration of WIMBD-based attribution to classify hallucinations into Type A/B/C errors based on pretraining data presence

Modeling

Base Model: 14 LLMs evaluated: Alpaca-7B, Falcon-40B, GPT-3.5, GPT-4, Llama-2 (7B/13B/70B), Llama-3 (8B/70B), Mistral-7B, Mixtral-8x7B, OLMo-7B, RedPajama (3B/7B)

Comparison to Prior Work

vs. FactScore: HALOGEN extends beyond biographies to 9 domains including code and science, and includes refusal tasks
vs. TruthfulQA: HALOGEN focuses on open-ended generation and grounded tasks rather than multiple choice
vs. HaluEval: HALOGEN uses external tools (PyPi, Semantic Scholar) for verification rather than relying solely on LLM judgment
+ 1 more
vs. FActCheck [not cited in paper]: HALOGEN incorporates a causal analysis of pretraining data to categorize error types (Type A/B/C)

Limitations

Reliability of benchmark scores depends on the accuracy of automated verifiers (though human agreement was high at >83%)
Attribution analysis is limited to models with public training data or identifiable open corpora (C4, OpenWebText)
Metrics do not account for coverage/recall (whether the model included all necessary information), only precision and refusal

Reproducibility

Code: https://halogen-hallucinations.github.io

Publicly available: HALOGEN benchmark dataset (prompts), verifier logic, and evaluation scripts. Code available at https://halogen-hallucinations.github.io. Pretraining data analysis relies on WIMBD and public corpora (C4, Dolma, OpenWebText).

📊 Experiments & Results

Evaluation Setup

Generative evaluation across 9 tasks: Code Packages, Summarization, Simplification, Biographies, Rationalization (Binary/Numerical), Scientific Attribution, Historical Events, False Presuppositions

Benchmarks:

HALOGEN (Multi-domain hallucination benchmark) [New]

Metrics:

Hallucination Score (percentage of generated facts that are unsupported)
Response Ratio (percentage of prompts the model attempts to answer)
Utility Score (composite metric rewarding factual answers and appropriate refusals)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on response-based tasks (lower Hallucination Score H is better).
HALOGEN (Code Packages)	Hallucination Score	0.02	0.06	+0.04
HALOGEN (Text Summarization)	Hallucination Score	0.03	0.03	0.00
HALOGEN (Biographies)	Hallucination Score	0.34	0.13	-0.21
Performance on refusal-based tasks (lower Response Ratio R is better, indicating the model correctly refused to answer a false premise).
HALOGEN (Historical Events)	Response Ratio	1.0	0.0	-1.0
HALOGEN (Scientific Attribution)	Response Ratio	0.67	0.48	-0.19
Causal analysis of hallucinations (Type B errors).
Code Generation	Training Data Coverage	38.36	72.41	+34.05

Experiment Figures

Counts of hallucination types (Type A/B/C) for historical event hallucinations across three models.

Analysis of Senator Search hallucinations (Type A errors) and Summarization error types.

Main Takeaways

No single domain predicts hallucination rates: models performing well on summarization may fail on code, highlighting the need for diverse benchmarks.
Large closed models (GPT-4) generally manage refusal better than open models, likely due to RLHF safety training.
Hallucination sources vary by domain: Code hallucinations are often Type B (deprecated/renamed packages found in training data), while historical hallucinations are often Type C (pure fabrication).
Content-grounded tasks (summarization) mostly suffer from intrinsic hallucinations (83%), where the model misinterprets the input, rather than introducing external falsehoods.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM hallucination types (intrinsic vs. extrinsic)
Familiarity with automated evaluation metrics (entailment, fact-checking)
Knowledge of pretraining data attribution methods

Key Terms

atomic unit: The smallest verifiable piece of information in a generation, such as a single sentence in a summary or a specific package import in code

Type A error: Hallucination where the correct fact was present in the pretraining data (failed recall)

Type B error: Hallucination where the incorrect fact (or fact taken out of context) was present in the pretraining data

Type C error: Hallucination where neither correct nor incorrect facts were in the pretraining data (pure fabrication)

Response-Based task: A task where the model is expected to provide a helpful answer (e.g., summarization, code generation)

Refusal-Based task: A task where the model is expected to abstain from answering because the premise is false or impossible (e.g., 'historically famous meeting between people who never met')

Utility Score: A metric that rewards factual answers for response-based tasks and rewards silence/refusal for refusal-based tasks

Response Ratio: The proportion of prompts for which the model attempts to generate an answer rather than refusing

WIMBD: A tool/index for searching large pretraining corpora to attribute model generations to training data

intrinsic hallucination: Errors where the model misinterprets information provided in the input context

extrinsic hallucination: Errors where the model introduces new, external information not found in the input context