Fine-grained Hallucination Detection and Editing for Language Models

📝 Paper Summary

Hallucination detection Hallucination editing / correction

Fava improves LM factuality by detecting and editing six distinct types of hallucinations using a retrieval-augmented model trained on carefully curated synthetic data.

Core Problem

Existing hallucination detection systems rely on simplistic binary labels (factual vs. non-factual) or focus only on entities, ignoring diverse error types like unverifiable claims or subjective opinions.

Why it matters:

Different hallucination types require different fix strategies (e.g., editing a specific entity vs. removing an entire unverifiable sentence)
Over 60% of LM-generated hallucinations are non-entity errors (e.g., unverifiable sentences) which current entity-focused systems miss
Deploying LMs in information-seeking contexts is dangerous without precise identification of fabricated or subjective content

Concrete Example: A model might generate 'The Messi Diaries,' a non-existent book. A binary detector just says 'wrong,' but Fava identifies this as an 'Invented' error requiring removal, whereas a date error like 'born in 2000' is an 'Entity' error requiring a specific edit.

Key Novelty

Fava (FActVerification with Augmentation)

Introduces a 6-category taxonomy for hallucinations (e.g., Entity, Relation, Invented, Subjective) rather than binary labels
Uses a 'generate-and-edit' pipeline where a small LM is trained to detect specific error spans and rewrite them using retrieved evidence
Creates synthetic training data by prompting GPT-4/ChatGPT to inject specific error types into clean text, simulating how LMs hallucinate

Architecture

The Fava inference pipeline showing retrieval and generation with tags.

Evaluation Highlights

Fava outperforms ChatGPT (with retrieval) by 23.7% on fine-grained hallucination detection accuracy
Improves the factuality score (FActScore) of Alpaca 13B outputs by 9.3% through automated editing
On binary detection, Fava outperforms the widely-used FActScore system and GPT-4 baselines

Breakthrough Assessment

8/10

Strong contribution with a necessary shift from binary to fine-grained detection. The taxonomy and synthetic data generation pipeline are highly practical for the field.

⚙️ Technical Details

Problem Definition

Setting: Open-ended text generation and correction given information-seeking queries

Inputs: Input query x and a corresponding potentially erroneous LM output y

Outputs: Edited output y^ containing error tags (span-level) and corrections

Pipeline Flow

Retriever (fetches relevant documents)
Editor (detects errors and generates corrections)

System Modules

Retriever

Retrieve relevant documents from Wikipedia to serve as evidence

Model or implementation: Contriever-MSMARCO

Editor (M_edit)

Identify error spans, classify error types, and generate corrected text

Model or implementation: Llama2-Chat 7B (fine-tuned)

Novel Architectural Elements

Unified detection and editing via tag generation: The model outputs text interleaved with error type tags (e.g., <Entity>...) and corrections in a single pass

Modeling

Base Model: Llama2-Chat 7B

Training Method: Supervised Fine-Tuning (SFT) on synthetic data

Training Data:

35,074 synthetic training instances
Seed passages from Wikipedia (30k) and Natural Questions (5k)
Errors injected by ChatGPT/GPT-4 based on the 6-type taxonomy

Compute: Not reported in the paper

Comparison to Prior Work

vs. FActScore: Fava provides fine-grained error types (6 categories) and corrections, not just binary verification
vs. ChatGPT/GPT-4 (Prompting): Fava is a fine-tuned local model (7B) that outperforms significantly larger proprietary models on this specific task
vs. RARR: Fava uses a detailed taxonomy to handle non-retrievable errors like subjective or invented content specifically [not cited in paper]

Limitations

Relies on retrieval quality; if no evidence is found, it may struggle to distinguish 'unverifiable' from 'invented'
Synthetic training data generation costs money (API fees for GPT-4)
Evaluation relies partially on model-based metrics (FActScore) which have their own biases

Reproducibility

Code: https://fine-grained-hallucination.github.io/

📊 Experiments & Results

Evaluation Setup

Detection and editing of hallucinations in LM-generated text using Wikipedia as a knowledge source

Benchmarks:

FavaBench (Fine-grained hallucination detection and editing) [New]

Metrics:

Macor F1 (Fine-grained detection)
FActScore (Factuality of edited text)
Binary F1 (Binary detection)
Statistical methodology: Inter-annotator agreement calculated using Cohen kappa scores

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Fava significantly outperforms baselines on the fine-grained detection task.
FavaBench	Macro F1	30.8	54.5	+23.7
FavaBench	Macro F1	39.8	54.5	+14.7
In binary detection settings, Fava remains superior to specialized systems and strong LLMs.
FavaBench	Binary F1	51.1	62.4	+11.3
Editing capabilities show Fava improves the factuality of various model outputs.
FavaBench (Alpaca 13B outputs)	FActScore Improvement	0.0	9.3	+9.3
FavaBench (ChatGPT outputs)	FActScore Improvement	0.0	3.3	+3.3

Experiment Figures

Distribution of hallucination types across different LMs (ChatGPT, Llama2-7B, Llama2-70B) in FavaBench.

Main Takeaways

Fine-grained detection is necessary: over 60% of hallucinations are not simple entity errors (e.g., unverifiable or subjective statements).
Synthetic data is effective: Training on data where errors are artificially injected by strong models allows a 7B model to outperform GPT-4 on this task.
Retrieval is crucial: Adding retrieval context significantly aids in detection, but the fine-grained taxonomy allows the model to handle cases where retrieval fails (e.g., marking as 'Unverifiable' or 'Invented').

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and hallucination
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of synthetic data generation

Key Terms

Hallucination: Factual errors or unverified statements generated by an LM given external world knowledge

FavaBench: A new benchmark dataset with ~1k fine-grained human annotations on LM outputs

Contriever: A dense retrieval model used to find relevant documents from Wikipedia

FActScore: A metric that breaks text into atomic facts and verifies them individually against a knowledge base

Zero-shot: Asking a model to perform a task without providing any examples in the prompt