Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

📝 Paper Summary

Machine Translation Hallucination Multilingual Evaluation

HalloMTBench is a large-scale, human-verified benchmark for diagnosing LLM translation hallucinations, built on a new taxonomy separating instruction-following failures from source-content deviations.

Core Problem

Existing machine translation benchmarks are obsolete because modern LLMs achieve near-zero hallucination rates on them, masking nuanced failures like instruction disobedience or subtle source deviations.

Why it matters:

Current benchmarks rely on older NMT failure modes, failing to capture LLM-specific issues like 'Instruction Detachment' where models ignore task constraints
High-stakes deployment requires understanding reliability across diverse languages, but current tests often focus on limited high-resource pairs or fail to trigger failures in large proprietary models

Concrete Example: When translating English to Portuguese, an LLM might generate fluent text in Spanish instead (Incorrect Target Language) or hallucinate a continuation of the source text (Extraneous Addition), failures that traditional NMT metrics often miss or mischaracterize.

Key Novelty

HalloMTBench & Dual-Class Taxonomy

Introduces a taxonomy splitting hallucinations into 'Instruction Detachment' (ignoring the task, e.g., wrong language) and 'Source Detachment' (ignoring the content, e.g., fabrication), tailored for instruction-following models
Curates 5,435 human-verified hallucination instances across 11 languages using a 'Generate → LLM-Jury Filter → Expert Verify' pipeline to ensure high difficulty and quality

Architecture

Conceptual flow of the benchmark creation process (No single architecture diagram for the model itself, as it is a benchmark paper)

Evaluation Highlights

93.68% to 100% agreement between the ensemble LLM-judge filtering method and human labels, ensuring high-quality data selection
Identified a 20-fold disparity in hallucination frequency across languages: Chinese had only 51 instances while Portuguese had 1,025, revealing severe language-specific imbalances
Qwen3-Max showed a 68.8% tendency towards 'Extraneous Addition', whereas GPT-4o-mini failed primarily via 'Incorrect Target Language' (69.2%), proving models have distinct failure fingerprints

Breakthrough Assessment

8/10

Significant contribution to MT evaluation by addressing the obsolescence of NMT benchmarks. The taxonomy is practical for LLMs, and the dataset scale (5k+ human-verified errors) is substantial.

⚙️ Technical Details

Problem Definition

Setting: Machine Translation as an instruction-following task

Inputs: Source text in English and a translation instruction (e.g., 'Translate English to [Target Language]')

Outputs: Translated text in the target language

Pipeline Flow

Candidate Generation: 4 LLMs generate translations for 4M source sentences
Automated Filtering: Ensemble of 3 LLM Judges votes on hallucinations
Expert Validation: Linguists verify and categorize candidates

System Modules

Generator Pool

Generate candidate translations from source text to expose potential failures

Model or implementation: GPT-4o-Mini, Gemini-2.0-Flash, Claude-3.5-Sonnet, Qwen3-Max

Ensemble Jury

Filter valid translations to identify potential hallucinations for human review

Model or implementation: Ensemble of GPT-4o, Claude-3.7-Sonnet, Gemini-2.5-flash

Human Annotators

Verify hallucinations and assign fine-grained taxonomy labels

Model or implementation: 5 professional linguists

Novel Architectural Elements

Taxonomy-guided filtering pipeline: Specifically separates instruction failures (wrong language) from content failures (fabrication) during the curation process

Modeling

Base Model: Evaluated 17 LLMs (including GPT-4o, Claude-3.5, Llama, Qwen, etc.) on the curated benchmark

Comparison to Prior Work

vs. HalOmi: Targets LLM-specific failures (instruction detachment) rather than just NMT failures; uses human verification rather than just model generation
vs. HalluciGen: Uses naturally occurring hallucinations from diverse LLMs rather than induced/seeded errors via instructions
vs. Guerreiro et al.: Expands taxonomy to include instruction-following constraints critical for LLMs (e.g., wrong language output)

Limitations

Evaluation covers only 11 language pairs (English-to-X), leaving many low-resource languages untested
Severe class imbalance: Chinese subset has only 51 examples vs. 1,025 for Portuguese
Taxonomy may not cover all nuanced 'reasoning' failures unique to newer reasoning models (o1, etc.)

Reproducibility

Code: https://huggingface.co/collections/AIDC-AI/marco-mt

Data publicly available at https://huggingface.co/collections/AIDC-AI/marco-mt. Source corpus derived from open WMT24 and HPLT datasets. Code for the judge pipeline not explicitly linked but methodology detailed.

📊 Experiments & Results

Evaluation Setup

Translation of English sentences into 11 target languages, evaluated for hallucination rate

Benchmarks:

HalloMTBench (Machine Translation Hallucination Detection) [New]

Metrics:

Hallucination Rate (Implicitly measured by count/distribution)
Taxonomy Distribution (Instruction vs. Source Detachment rates)
Statistical methodology: Cohen's Kappa reported for inter-annotator agreement (0.8 minimum)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Analysis of the benchmark composition reveals significant variance in difficulty across languages and model-specific failure modes.
HalloMTBench	Instance Count (Portuguese)	Not reported in the paper	1025	N/A
HalloMTBench	Instance Count (Chinese)	Not reported in the paper	51	N/A
HalloMTBench	Extraneous Addition Rate (Qwen3-Max)	Not reported in the paper	68.8%	N/A
HalloMTBench	Incorrect Language Rate (GPT-4o-mini)	Not reported in the paper	69.2%	N/A

Experiment Figures

Distribution of hallucination instances across the 11 target languages

Breakdown of hallucination types (Taxonomy) per model

Main Takeaways

Modern LLMs fail in distinct ways: smaller models (GPT-4o-mini) tend to output the wrong language, while larger models (Qwen3-Max) tend to hallucinate extra content.
Hallucination rates follow a U-shaped curve regarding source length: models fail more on very short (context-poor) and very long (context-heavy) inputs.
RL-tuned models show heightened 'confusion' (language mixing), suggesting that RL alignment might destabilize multilingual boundaries.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Neural Machine Translation (NMT) vs. LLM-based translation
Familiarity with hallucination types (oscillatory, detached)
Basic knowledge of LLM-as-a-judge evaluation methods

Key Terms

Instruction Detachment: A hallucination category where the model fails to adhere to task constraints, such as translating into the wrong language or repeating the source untranslated

Source Detachment: A hallucination category where the model generates fluent text that deviates from the source content, including fabrications or repetitions

Oscillatory Hallucinations: Meaningless repetition of words or phrases, common in NMT and preserved in the new taxonomy under 'Repetition'

LLM Judges: Using strong LLMs to evaluate the output of other models, here used as an automated filter before human verification

WMT: Conference on Machine Translation—a major academic venue providing standard datasets for translation tasks

HPLT: High Performance Language Technologies—a project providing large-scale multilingual datasets