ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models

📝 Paper Summary

Hallucination detection Hallucination mitigation Automated evaluation

ANAH is an iterative self-training framework that progressively scales up hallucination annotation datasets and improves annotator accuracy by alternating between annotating new data and retraining the annotator model.

Core Problem

Existing hallucination datasets are small and domain-limited due to high human annotation costs, while current automatic annotators (including GPT-4) lack the reliability needed for scalable oversight.

Why it matters:

Manual fine-grained annotation requires intensive labor to verify facts against long documents, making it prohibitively expensive to scale
Unreliable automatic annotators produce inaccurate results, hindering the ability to detect and mitigate hallucinations in real-world LLM applications
Limited dataset diversity prevents models from generalizing to hallucinations across different domains and response styles

Concrete Example: When asking an LLM about a specific event, it might generate a plausible but incorrect detail. A standard annotator might miss this or hallucinate its own judgment. ANAH mitigates this by breaking the judgment into factual existence, reference extraction, and type determination, then self-training on these rigorous steps.

Key Novelty

Iterative Self-Training via Expectation Maximization

Treats the dataset scaling process like an Expectation Maximization (EM) algorithm: the 'Expectation' step estimates ground-truth annotations on new data using the current best model, and the 'Maximization' step trains a better model on this larger dataset
Decomposes the annotation task into three distinct cognitive phases (Factuality Check → Reference Extraction → Hallucination Type) to mirror human verification processes
Uses a self-consistency strategy during the annotation phase to ensure robust labels for the next round of training, filtering out noise from the model's predictions

Architecture

The iterative self-training framework showing the EM cycle (E-step: Annotation Pipeline, M-step: Training) and the three stages of data scaling.

Evaluation Highlights

+8.2% accuracy improvement over GPT-4 on the HaluEval benchmark (81.54% vs 73.34%) using the 7B parameter ANAH-v2 model
+12% improvement in Natural Language Inference (NLI) metric (from 25% to 37%) on HaluEval when using the annotator to rerank LLM generations
Achieves state-of-the-art zero-shot results on HalluQA (94.44%) and the in-domain ANAH benchmark (89.24%)

Breakthrough Assessment

8/10

Significantly outperforms GPT-4 with a much smaller 7B model for hallucination detection. The iterative self-training framework offers a scalable path for dataset creation without heavy human labeling.

⚙️ Technical Details

Problem Definition

Setting: Fine-grained hallucination annotation where a model identifies non-factual information in generated text relative to a reference document

Inputs: A tuple consisting of a question, a specific sentence to be annotated, and a reference document

Outputs: Annotation consisting of factual existence label, extracted reference points, and hallucination type (No Hallucination, Contradictory, or Unverifiable)

Pipeline Flow

Stage 1: Seed Training (Train initial annotator on human data)
Stage 2: Response Scaling (Generate annotations for diverse model responses)
Stage 3: Topic Scaling (Generate questions/annotations for new topics)

System Modules

Annotator (Inference) (Annotation Pipeline)

Generate candidate annotations for unlabeled data

Model or implementation: ANAH-v2 (InternLM2-7B base)

Voter/Selector (Annotation Pipeline)

Select the most consistent annotation from candidates

Model or implementation: Algorithmic (Majority Vote + Cosine Similarity)

Trainer

Fine-tune the model on the expanded dataset

Model or implementation: Standard SFT (Supervised Fine-Tuning)

Novel Architectural Elements

Three-phase annotation prompt structure: Factual Existence → Reference Extraction → Hallucination Type
Iterative EM-based loop where the output of the inference pipeline becomes the training data for the next version

Modeling

Base Model: InternLM2-7B

Training Method: Supervised Fine-Tuning (SFT) within an iterative loop

Objective Functions:

Purpose: Maximize the likelihood of the selected high-quality annotations.

Formally: θ_{t+1} = argmax_θ Σ log P(y* | x; θ)

Adaptation: Full fine-tuning

Training Data:

Seed: ANAH-v1 (human annotations)
Stage 2: Added ~196k responses from 13 different LLMs (e.g., Llama2, Qwen, Baichuan)
Stage 3: Expanded to ~3k topics (Location, Person, Event, Thing) with auto-generated questions
Total: ~822k annotated sentences

Key Hyperparameters:

model_parameters: 7B

Compute: Not reported in the paper

Comparison to Prior Work

vs. GPT-4: ANAH-v2 is a specialized smaller model (7B) that outperforms the generalist GPT-4 on this specific task via iterative self-training
vs. Standard SFT: ANAH uses an iterative EM-like process to progressively clean and scale the data, rather than one-off training
vs. Binary Classifiers: Uses a fine-grained, sentence-level analytical approach (Fact -> Reference -> Type) rather than just labeling a whole response as hallucinated or not

Limitations

Reliance on the quality of the initial seed data and the base model's capability to improve via self-consistency
Computational cost of the self-consistency step (multiple inferences per sentence) during data generation
Performance depends on the retrieval of relevant documents; poor retrieval impacts annotation quality

Reproducibility

Code: https://github.com/open-compass/ANAH

Dataset, code, and model are publicly released at https://github.com/open-compass/ANAH. The paper details the prompts used for annotation and data generation in the Appendices.

📊 Experiments & Results

Evaluation Setup

Zero-shot hallucination detection on multiple benchmarks

Benchmarks:

ANAH (In-domain) (Fine-grained hallucination annotation)
HaluEval (Hallucination detection in QA)
HalluQA (Chinese hallucination benchmark)

Metrics:

Accuracy
F1 Score
Natural Language Inference (NLI) metric (for mitigation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance comparisons on standard hallucination benchmarks show ANAH-v2 surpassing larger models.
HaluEval	Accuracy	73.34	81.54	+8.20
HalluQA	Accuracy	82.47	94.44	+11.97
ANAH (In-domain)	Accuracy	73.40	89.24	+15.84
Hallucination mitigation results using the annotator as a re-ranker.
HaluEval	NLI (Natural Language Inference)	25	37	+12

Experiment Figures

Detailed breakdown of the E-step (Annotation Pipeline) and M-step.

Sunburst chart showing the diversity of the final dataset topics.

Main Takeaways

Iterative self-training effectively scales dataset size and quality simultaneously without human intervention beyond the seed set
A specialized 7B model can outperform GPT-4 on specific fine-grained annotation tasks when trained on high-quality, self-generated data
Decomposing the annotation task into analytical steps (Fact -> Reference -> Type) improves accuracy compared to direct classification
The resulting annotator serves as an effective reward model for mitigation strategies like re-ranking

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination phenomena
Familiarity with Instruction Tuning and Self-Training
Basic knowledge of the Expectation-Maximization (EM) algorithm

Key Terms

Expectation Maximization (EM): An iterative algorithm used here to alternate between estimating labels for unlabeled data (E-step) and updating the model parameters using those labels (M-step)

Self-Consistency: A decoding strategy where the model generates multiple reasoning paths and answers, selecting the most consistent one via majority voting to improve reliability

NLI metric: Natural Language Inference metric—used here to measure how well a generated response entails or contradicts a reference

Hallucination Types: Classifications used in the paper: 'No Hallucination' (faithful), 'Contradictory Hallucination' (conflicts with reference), and 'Unverifiable Hallucination' (not in reference)

InternLM2-7B: The specific open-source foundation model used as the base for the ANAH annotator

Zero-shot inference: Evaluating a model on a task without providing any specific examples of that task in the immediate input prompt