XRAG: Cross-lingual Retrieval-Augmented Generation

📝 Paper Summary

Cross-lingual RAG Benchmark Construction

XRAG is a benchmark for evaluating LLMs in cross-lingual RAG settings, revealing significant failures in response language correctness and cross-lingual reasoning even for advanced models like GPT-4o.

Core Problem

Existing cross-lingual QA datasets contain simple questions often answerable without retrieval, failing to evaluate complex reasoning in realistic RAG scenarios where user and document languages differ.

Why it matters:

Real-world RAG systems must serve non-English users using English knowledge bases (monolingual retrieval) or a mix of local and English sources (multilingual retrieval)
Current benchmarks like XOR QA allow models to answer nearly 50% of questions using parametric knowledge alone, masking true retrieval-reasoning failures
Absence of challenging benchmarks prevents understanding of specific cross-lingual failure modes, such as responding in the wrong language or failing to integrate multi-language evidence

Concrete Example: A German user asks a question about the 2024 Olympics. The relevant information is split between a German article and an English article. Current LLMs might fail to combine these facts or erroneously answer in English instead of German.

Key Novelty

XRAG Benchmark Construction Pipeline

Generates questions from recent news (post-knowledge-cutoff) to ensure models cannot answer from parametric memory
Uses a multi-step LLM workflow to create 'cross-document' questions that require integrating information from two distinct articles (supporting evidence) while ignoring distractors
Covers two specific settings: Monolingual Retrieval (non-English question, English docs) and Multilingual Retrieval (non-English question, mixed-language docs)

Architecture

The data construction pipeline for generating cross-document QA pairs.

Evaluation Highlights

In Monolingual Retrieval, GPT-4o achieves only 55.5% accuracy, significantly lower than the 85% human upper bound
Models struggle with Response Language Correctness (RLC): Mistral-large answers in the wrong language (English instead of user language) in 61.1% of cases
In Multilingual Retrieval, translating supporting documents to English improves GPT-4o accuracy by ~9.5 percentage points, indicating the core bottleneck is cross-lingual reasoning, not generation

Breakthrough Assessment

8/10

Identifies a critical gap in cross-lingual RAG evaluation (parametric leakage in old benchmarks) and uncovers a major, previously under-reported failure mode (Response Language Correctness). High utility for future RAG research.

⚙️ Technical Details

Problem Definition

Setting: Cross-lingual Retrieval-Augmented Generation (RAG) with imperfect retrieval

Inputs: Question q in language L, set of documents D (containing supporting D+ and distracting D- articles)

Outputs: Answer a in language L

Pipeline Flow

Input: Question (Non-English) + Documents (English or Mixed)
LLM Generation (Answer in User Language)
Evaluation (Judge Panel + Language Detector)

System Modules

Generator

Generate answer based on retrieved context

Model or implementation: Evaluated Models (GPT-4o, Claude 3.5 Sonnet, Mistral-large, Command-R+, Nova Pro)

Novel Architectural Elements

Evaluation-only benchmark paper; no new system architecture proposed. Novelty lies in the dataset construction pipeline (identifying related articles -> summarizing -> simple QA -> cross-document QA linkage).

Modeling

Base Model: GPT-4o, Claude 3.5 Sonnet, Mistral-large, Command-R+, Nova Pro

Comparison to Prior Work

vs. XOR QA: XRAG ensures questions are time-sensitive (post-cutoff) and require cross-document reasoning, preventing parametric answering (6.3% vs ~47% answerable without retrieval)
vs. MIRAGE-Bench: XRAG specifically targets cross-lingual mismatch (English docs for non-English questions) rather than just multilingual capabilities
vs. RGB [not cited in paper]: RGB evaluates noise robustness in English; XRAG extends noise robustness (distractors) to cross-lingual settings

Limitations

Benchmark construction relies on GPT-4o, potentially biasing evaluation in its favor
Evaluation is limited to five distinct languages (English, German, Spanish, Chinese, Arabic)
Requires commercial LLMs (GPT-4o) for the judge panel, which incurs cost and reproducibility issues

Reproducibility

The paper describes the data construction pipeline in detail (prompts provided in figures). The specific dataset (XRAG) is described as a benchmark but the code URL is not explicitly provided in the text. Evaluation relies on proprietary models (GPT-4o, Claude) and public models (Mistral, Command-R+).

📊 Experiments & Results

Evaluation Setup

Open-domain QA with provided context (2 gold + 6 distractors)

Benchmarks:

XRAG-Monolingual (Cross-lingual QA (Non-En Q -> En Docs)) [New]
XRAG-Multilingual (Cross-lingual QA (Non-En Q -> Mixed Docs)) [New]

Metrics:

Accuracy (judged by LLM panel)
Response Language Correctness (RLC)
Statistical methodology: Cohen's kappa score (0.71) reported for correlation between LLM judges and human judges. No significance tests for model performance differences.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance in the Monolingual Retrieval setting (Non-English Question, English Documents) shows significant degradation compared to English baselines.
XRAG-Monolingual (Average)	Accuracy	75.40	55.50	-19.90
XRAG-Monolingual (Average)	Accuracy	56.40	31.40	-25.00
Response Language Correctness (RLC) is a major failure mode in Monolingual Retrieval settings.
XRAG-Monolingual (Average)	Wrong Language %	0.00	1.30	+1.30
XRAG-Monolingual (Average)	Wrong Language %	0.00	61.10	+61.10
Controlled analysis in Multilingual Retrieval reveals that reasoning over cross-lingual documents is harder than generation.
XRAG-Multilingual (Analysis)	Accuracy	57.58	67.08	+9.50
XRAG-Multilingual (Analysis)	Accuracy	57.58	58.05	+0.47

Experiment Figures

Bar chart of 'Wrong Language Rate' (RLC errors) for five models in the Monolingual Retrieval setting.

Main Takeaways

Parametric knowledge is insufficient: Models achieve <16% accuracy without retrieval on XRAG, confirming the benchmark successfully targets retrieval-dependent reasoning.
Language Correctness is a bottleneck: In monolingual retrieval settings, many models (especially Mistral-large) revert to English responses instead of the user's language.
Reasoning > Generation: In multilingual retrieval, the primary difficulty is reasoning across documents in different languages, not generating text in the target language.
Human-LLM Gap: Even the best model (GPT-4o) trails human performance (85%) significantly, highlighting the difficulty of cross-document reasoning.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with cross-lingual QA challenges
Knowledge of LLM-as-a-Judge evaluation

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Monolingual Retrieval: A cross-lingual RAG setting where the user asks in a non-English language, but the retrieval system provides only English documents

Multilingual Retrieval: A cross-lingual RAG setting where the retrieval system provides documents in both English and the user's native language

Response Language Correctness: A metric checking if the LLM's generated answer matches the language of the user's question

Parametric Knowledge: Information stored in the model's weights during pre-training, as opposed to information retrieved from external documents

Cross-document reasoning: The ability to answer questions that require combining partial clues from multiple separate documents

LLM-as-a-Judge: Using a strong LLM (like GPT-4) to evaluate the correctness of another model's output