Comparing Hallucination Detection Metrics for Multilingual Generation

📝 Paper Summary

Hallucination Detection Multilingual Evaluation

This paper evaluates existing English-centric hallucination detection metrics on 19 languages, finding that NLI-based methods outperform lexical metrics but struggle significantly in low-resource settings.

Core Problem

Most hallucination detection research focuses on English, leaving it unclear whether current methods work for multilingual LLMs, especially in low-resource languages where generation quality varies.

Why it matters:

Multilingual LLMs are prone to generating text that conflicts with world knowledge, similar to English models.
Low-resource languages often lack the labeled datasets required for supervised hallucination detection, necessitating unsupervised or transfer-based metrics.
It is unknown if metrics like ROUGE or NLI transfer effectively to languages with different linguistic features or lower training data availability.

Concrete Example: A model generating a biography in Ukrainian averages only 5.7 tokens per generation. Standard metrics might misinterpret this brevity or the model's language-switching (e.g., generating English text when prompted in Italian) as factual accuracy or hallucination.

Key Novelty

Benchmarking Automatic Hallucination Metrics Across 19 Languages

Creates a new parallel dataset of biographical generations across 19 diverse languages (high, mid, and low resource) to serve as a testbed.
Systematically compares lexical metrics (ROUGE, Entity Overlap) against NLI-based metrics (entailment, contradiction) in both reference-based and pairwise settings.
Evaluates the correlation of these automatic metrics with human judgments to determine their reliability in non-English contexts.

Evaluation Highlights

NLI-based metrics correlate with human judgments in high-resource languages but fail in low-resource settings (e.g., correlation drops to non-significant levels for Finnish/Persian).
Lexical overlap metrics (ROUGE, Named Entity Overlap) show poor agreement with NLI metrics and human judgments for detecting hallucinations.
NLI metrics outperform supervised models at detecting verifiable hallucinations but struggle with single-fact errors and unverifiable claims.

Breakthrough Assessment

7/10

Important benchmarking study revealing the fragility of current hallucination detection methods in multilingual contexts. While it doesn't propose a new architecture, it exposes critical gaps in low-resource evaluation.

⚙️ Technical Details

Problem Definition

Setting: Detecting factual inconsistencies in long-form text generation (biographies) across multiple languages.

Inputs: A generated biography text T_gen and a reference text T_ref (Wikipedia article) or a set of other generated samples.

Outputs: A scalar score indicating the degree of hallucination or factuality.

Pipeline Flow

Generation: Produce biographies using BLOOMZ-mt
Quality Control: Filter incorrect languages using langdetect
Metric Calculation: Apply ROUGE, NEO, and NLI metrics
Correlation Analysis: Compare metrics against each other and human labels

System Modules

Generator

Generate biographical text for 500 entities across 19 languages

Model or implementation: BLOOMZ-mt

Language Filter

Remove generations that are in the wrong language

Model or implementation: langdetect (Python library)

NLI Scorer

Compute entailment/contradiction scores between generated sentences and reference sentences

Model or implementation: SummaC (Zero-shot NLI)

Novel Architectural Elements

Application of SummaC aggregation strategy (max score across reference sentences) specifically for multilingual biography hallucination detection

Modeling

Base Model: BLOOMZ-mt (for text generation)

Training Method: The paper focuses on evaluating existing metrics, not training new models. The generator (BLOOMZ-mt) was pre-trained/fine-tuned by prior work.

Key Hyperparameters:

top_p: 0.9
num_generations_per_prompt: 5

Compute: Not reported in the paper

Comparison to Prior Work

vs. ROUGE/NEO: NLI metrics correlate better with human judgments of factuality than lexical overlap.
vs. Supervised Classifiers: NLI metrics (zero-shot) often outperform supervised hallucination detectors on verifiable facts.
vs. Monolingual Evaluation: Extends evaluation to 19 languages, revealing that English-centric metrics fail in low-resource settings.

Limitations

NLI metrics fail to detect single-fact hallucinations effectively.
Pairwise and reference-based metrics show significantly diminished correlation in low-resource languages.
Lexical overlap metrics are generally ineffective for hallucination detection.
NLI metrics perform poorly on unverifiable errors (hallucinations that cannot be checked against the reference).

📊 Experiments & Results

Evaluation Setup

Biography generation for 500 entities across 19 languages.

Benchmarks:

Multilingual Biography Dataset (Open-ended text generation) [New]

Metrics:

ROUGE-1 / ROUGE-L
Named Entity Overlap (Precision, Recall, F1)
NLI-based scores: ENT (Entailment), CON (Contradiction), DIFF (Entailment - Contradiction), UNV (Unverifiable)
Pearson Correlation (between metrics and human judgments)
Statistical methodology: Pearson Correlation Coefficients reported for metric agreement.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation analysis shows NLI metrics align better with human judgment than lexical metrics, but effectiveness varies by resource level.
Multilingual Biography Dataset	Pearson Correlation (Pairwise vs Ref NLI)	Not statistically significant	0.35 to 0.56	Significant positive correlation
Multilingual Biography Dataset	UNV Score Range	0.15 - 0.25	> 0.25	Higher

Experiment Figures

Heatmap of Pearson correlations between different automatic metrics (ROUGE, NEO, NLI) for English.

Main Takeaways

Lexical metrics (ROUGE, Entity Overlap) do not correlate with NLI metrics or human judgments, making them poor proxies for factuality.
NLI-based metrics (especially DIFF) remain relatively stable across languages but perform best on high-resource languages.
Pairwise consistency (checking agreement between model generations) correlates well with reference-based scoring in high-resource languages but fails in low-resource settings.
Generation quality varies drastically: High-resource languages produce long, accurate text; low-resource languages (e.g., Ukrainian) produce very short (avg 5.7 tokens) or wrong-language text.
NLI metrics are good at detecting sentence-level hallucinations but struggle with 'atomic' single-fact errors.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Model (LLM) hallucination problems.
Understanding of automatic evaluation metrics (ROUGE, NLI).
Basic knowledge of zero-shot classification.

Key Terms

NLI: Natural Language Inference—a task determining if a hypothesis is entailed by, contradicts, or is neutral to a premise.

ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics measuring token overlap between generated text and a reference summary.

SummaC: A zero-shot, NLI-based consistency scoring system originally designed for summarization, used here to detect hallucinations by checking sentence-level entailment.

Reference-based metrics: Evaluation methods that compare generated text against a gold-standard ground truth (e.g., Wikipedia article).

Pairwise metrics: Evaluation methods that compare a generated text against other samples generated by the same model to check for self-consistency.

Atomic facts: The smallest indivisible units of information in a sentence (e.g., 'Obama was born in Hawaii' contains facts about the person, action, and location).

Verifiable hallucination: Generated content that can be explicitly proven true or false based on the reference text.

Unverifiable hallucination: Generated content that is not present in the reference text, making it impossible to prove or disprove using that source alone.

High-resource languages: Languages with abundant training data available (e.g., English, Chinese, French).

Low-resource languages: Languages with limited training data available (e.g., Ukrainian, Persian).