ANAH: Analytical Annotation of Hallucinations in Large Language Models

📝 Paper Summary

Hallucination Detection Hallucination Correction Benchmark Datasets

ANAH is a large-scale bilingual benchmark that provides sentence-by-sentence analytical annotations of LLM hallucinations—including reference retrieval, type judgment, and correction—to train superior generative annotators.

Core Problem

Existing benchmarks for LLM hallucinations are often coarse-grained (labeling entire responses without explanation) or outdated (pre-LLM era), making it difficult to trace, analyze, and mitigate specific errors.

Why it matters:

Hallucinations hinder real-world applications of LLMs, especially in knowledge-intensive tasks, by disseminating misleading information
Coarse labels (hallucination vs. non-hallucination) fail to identify exactly which sentence is wrong or why (contradictory vs. unverifiable), impeding targeted mitigation
Detecting hallucinations in fluent, plausible-sounding LLM responses is increasingly difficult for humans and automated systems alike

Concrete Example: In a generated biography, an LLM might correctly state a person's birth year in sentence 1 but invent a fictional award in sentence 2. Current coarse benchmarks label the whole response 'hallucinated' without distinguishing the correct sentence from the error, whereas ANAH annotates sentence 2 specifically as 'Contradictory Hallucination' with a corrected version.

Key Novelty

Sentence-Level Analytical Annotation Pipeline

Constructs a dataset where every sentence in an LLM's answer is annotated with retrieved reference fragments, a specific hallucination type, and a corrected version
Uses a human-in-the-loop pipeline (GPT-4 preliminary annotation + human verification) to scale high-quality fine-grained data across 700+ topics
Demonstrates the 'snowball effect' quantitatively: the probability of a hallucination increases significantly if previous sentences were also hallucinations

Architecture

The concept of Analytical Hallucination Annotation, comparing coarse-grained labeling (whole response) with fine-grained analytical annotation (sentence-level)

Evaluation Highlights

Generative annotator trained on ANAH achieves 81.01% accuracy, surpassing GPT-3.5 and all open-source models
Generative annotator performance is competitive with GPT-4 (86.97% accuracy) while being smaller and more cost-effective
Quantitatively confirms hallucination accumulation: Hallucination probability jumps from ~15% to ~55% if previous sentences contained hallucinations

Breakthrough Assessment

8/10

Significant contribution to fine-grained hallucination analysis. The dataset enables training smaller models to detect errors with GPT-4-level performance, addressing a critical measurement gap.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-based Generative Question Answering hallucination annotation

Inputs: Topic documents, a question, and an LLM-generated answer consisting of multiple sentences

Outputs: For each sentence: Reference fragment, Hallucination Type (No/Contradictory/Unverifiable/No Fact), and Correction

Pipeline Flow

Topic Selection & Reference Retrieval
Question Generation & Selection
Answer Generation (High/Low Quality)
Fine-grained Hallucination Annotation

System Modules

Topic Selector (Data Construction)

Select diverse topics (celebrities, events, etc.) based on frequency in Google Ngram Viewer

Model or implementation: N/A (Heuristic/Statistical)

Reference Retriever (Data Construction)

Retrieve reliable documents for selected topics

Model or implementation: InternLM (for judging semantic similarity)

Question Generator (Data Construction)

Create questions fully answerable by the references

Model or implementation: GPT-4 (Selection), GPT-3.5 (Answerability check)

Annotator (Pipeline)

Retrieve fragments, judge hallucination type, and correct errors per sentence

Model or implementation: GPT-4 (Preliminary) + Human Verification

Novel Architectural Elements

Iterative annotation pipeline combining automated retrieval ensemble (BM25 + embeddings) with GPT-4 preliminary labeling and human refinement
Dual-annotator training strategy: training both Generative (full explanation) and Discriminative (classification only) models on the resulting dataset

Modeling

Base Model: InternLM-7B (specifically used for training the custom annotator)

Training Method: Supervised Fine-Tuning (SFT)

Training Data:

~12k sentence-level annotations
Split into training and test sets (exact split ratio not explicitly in text, but ~4.3k total answers)

Key Hyperparameters:

context_length_chinese: 540 tokens (reference fragment)
context_length_english: 400 tokens (reference fragment)

Compute: Not reported in the paper

Comparison to Prior Work

vs. HaluEval: HaluEval focuses on document-level judgment; ANAH provides sentence-level fine-grained annotation with retrieval and correction
vs. SAFE [not cited in paper]: SAFE uses LLMs to break down claims and verify atomic facts; ANAH integrates this into a training dataset for smaller models rather than just an evaluation method

Limitations

Reliance on proprietary models (GPT-4) for preliminary annotation may inherit biases
Discriminative annotators struggle with class imbalance (Unverifiable/No Fact are rarer)
Generalization to unseen topics is harder than generalization to unseen questions within known topics

Reproducibility

Code: https://github.com/open-compass/ANAH

publicly available (https://github.com/open-compass/ANAH). Dataset, code, and model are released. Prompts for generation and annotation are provided in Appendix. Exact training hyperparameters (learning rate, batch size) for the annotator model are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Hallucination Annotation (judging the correctness of sentences)

Benchmarks:

ANAH Test Set (Hallucination Detection/Correction) [New]

Metrics:

Accuracy (Consistency with human annotation)
F1 Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different models acting as generative annotators on the ANAH dataset, measuring consistency with human ground truth.
ANAH Test Set	Accuracy	86.97	81.01	-5.96
ANAH Test Set	Accuracy	72.45	81.01	+8.56
ANAH Test Set	Accuracy	57.38	81.01	+23.63
Analysis of the 'Snowball Effect' of hallucinations.
ANAH Dataset	Hallucination Probability (English)	14.61	58.51	+43.90
ANAH Dataset	Hallucination Probability (Chinese)	17.20	52.54	+35.34

Experiment Figures

The four-stage construction pipeline of the ANAH dataset

Sunburst chart of the dataset's topic distribution

Main Takeaways

Generative annotators (predicting explanation + label) handle class imbalance better than discriminative annotators (predicting label only)
Training on ANAH allows a 7B model to surpass GPT-3.5 and rival GPT-4 in hallucination detection accuracy
Hallucinations exhibit a strong 'snowball effect' where initial errors propagate to subsequent sentences
Generalization is stronger across new questions for known topics than across entirely new topics, suggesting data scaling should prioritize topic breadth

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and Hallucinations
Familiarity with Retrieval-Augmented Generation (RAG) concepts
Knowledge of fine-tuning and evaluation metrics (Accuracy, F1)

Key Terms

Snowball effect: The phenomenon where hallucinations progressively accumulate; an error in one sentence increases the likelihood of errors in subsequent sentences

RAG: Retrieval-Augmented Generation—providing external documents to an LLM to ground its answers

Generative Annotator: An LLM trained to output text explaining the hallucination (type, reference, correction) rather than just a classification label

Discriminative Annotator: A model trained only to classify the type of hallucination without generating corrections or references

CoSENT: A text embedding model used for sentence similarity and retrieval tasks

BM25: A ranking function used by search engines to estimate the relevance of documents to a given search query

Contradictory Hallucination: Information that conflicts with the provided reference source

Unverifiable Hallucination: Information not found in the reference source, making its truth value unknown based on available context

No Fact: Sentences containing no factual claims (e.g., chit-chat or structural transitions)