← Back to Paper List

Comparing Hallucination Detection Metrics for Multilingual Generation

Haoqiang Kang, Terra Blevins, Luke Zettlemoyer
Paul G. Allen School of Computer Science & Engineering, University of Washington
arXiv (2024)
Factuality Benchmark

📝 Paper Summary

Hallucination Detection Multilingual Evaluation
This paper evaluates existing English-centric hallucination detection metrics on 19 languages, finding that NLI-based methods outperform lexical metrics but struggle significantly in low-resource settings.
Core Problem
Most hallucination detection research focuses on English, leaving it unclear whether current methods work for multilingual LLMs, especially in low-resource languages where generation quality varies.
Why it matters:
  • Multilingual LLMs are prone to generating text that conflicts with world knowledge, similar to English models.
  • Low-resource languages often lack the labeled datasets required for supervised hallucination detection, necessitating unsupervised or transfer-based metrics.
  • It is unknown if metrics like ROUGE or NLI transfer effectively to languages with different linguistic features or lower training data availability.
Concrete Example: A model generating a biography in Ukrainian averages only 5.7 tokens per generation. Standard metrics might misinterpret this brevity or the model's language-switching (e.g., generating English text when prompted in Italian) as factual accuracy or hallucination.
Key Novelty
Benchmarking Automatic Hallucination Metrics Across 19 Languages
  • Creates a new parallel dataset of biographical generations across 19 diverse languages (high, mid, and low resource) to serve as a testbed.
  • Systematically compares lexical metrics (ROUGE, Entity Overlap) against NLI-based metrics (entailment, contradiction) in both reference-based and pairwise settings.
  • Evaluates the correlation of these automatic metrics with human judgments to determine their reliability in non-English contexts.
Evaluation Highlights
  • NLI-based metrics correlate with human judgments in high-resource languages but fail in low-resource settings (e.g., correlation drops to non-significant levels for Finnish/Persian).
  • Lexical overlap metrics (ROUGE, Named Entity Overlap) show poor agreement with NLI metrics and human judgments for detecting hallucinations.
  • NLI metrics outperform supervised models at detecting verifiable hallucinations but struggle with single-fact errors and unverifiable claims.
Breakthrough Assessment
7/10
Important benchmarking study revealing the fragility of current hallucination detection methods in multilingual contexts. While it doesn't propose a new architecture, it exposes critical gaps in low-resource evaluation.
×