RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

📝 Paper Summary

Modularized RAG pipeline Benchmark datasets Metrics and evaluation

Lfrqa provides high-quality human-written long-form answers aggregating multiple documents across domains, enabling Rag-qa Arena to robustly evaluate RAG systems using LLM judges instead of noisy retrieval-based metrics.

Core Problem

Existing RAG-QA benchmarks (like RobustQA) use short extractive answers that penalize modern LLMs' long-form responses via token overlap metrics, while naive concatenation of short answers yields incoherent references.

Why it matters:

Real-world RAG systems generate comprehensive narratives, not just short spans, making extractive metrics (EM, F1) unsuitable
Current benchmarks lack high-quality cross-domain long-form references, hindering the measurement of out-of-domain (OOD) robustness
Evaluation typically requires checking retrieved passages, which is noisy; evaluating directly against a high-quality 'gold' answer is cleaner

Concrete Example: In RobustQA, the answer to 'Are password manager apps safe?' consists of disjoint fragments like 'many ways to compromise' and 'mostly anecdotal evidence'. When an LLM generates a fluent paragraph explaining safety nuances, it gets a low F1 score against these fragments despite being correct.

Key Novelty

Long-form RobustQA (Lfrqa) & Rag-qa Arena

Constructs a dataset where human annotators integrate multiple short extractive spans into single, coherent long-form narratives across 7 domains
Establishes an evaluation framework using LLMs as judges to compare model outputs directly against these comprehensive ground-truth answers (pairwise preference)
Demonstrates that evaluating against Lfrqa answers correlates highly with human judgment, eliminating the need to reference potentially noisy retrieved documents during evaluation

Architecture

The Rag-qa Arena evaluation framework workflow.

Evaluation Highlights

Lfrqa answers are preferred over RobustQA's concatenated extractive answers with a 93.3% win rate by human judges
Only 41.3% of GPT-4o's answers (using top-10 retrieval) are preferred over Lfrqa's human-written ground truth, showing the benchmark remains challenging
GPT-4-0125-preview as a judge achieves 0.52 Pearson correlation with human annotators, validating the model-based evaluation framework

Breakthrough Assessment

8/10

Significantly improves RAG evaluation by moving from extractive metrics to semantic long-form comparison. The dataset addresses a critical gap (coherent multi-doc aggregation) and the arena framework is practical for modern LLMs.

⚙️ Technical Details

Problem Definition

Setting: Generative Question Answering where a model P generates answer text given a query q and retrieved passages C_q

Inputs: Query q, Top-K retrieved passages C_q

Outputs: Long-form generated answer w_1...w_T

Pipeline Flow

Retriever (ColBERTv2 fetches top-K passages)
Generator (LLM generates answer from passages)
Evaluator (LLM compares generated answer vs. Lfrqa ground truth)

System Modules

Retriever

Select K most relevant passages from the corpus

Model or implementation: ColBERTv2

Generator

Generate a coherent answer based on retrieved context

Model or implementation: Various LLMs (e.g., GPT-4o, Llama-3-70b-Instruct)

Evaluator

Judge preference between Model Answer and Lfrqa Answer

Model or implementation: GPT-4-0125-preview

Novel Architectural Elements

Evaluation framework using coherent human-written long-form answers (Lfrqa) as the sole reference 'gold standard' for pairwise LLM judging, replacing reliance on retrieved passages or extractive span overlap

Modeling

Base Model: Various (GPT-4o, GPT-4-turbo, Mixtral-8x22B, Llama-3-70B, Qwen1.5-110B, Command R+)

Comparison to Prior Work

vs. RobustQA: Lfrqa integrates spans into coherent narratives rather than disjoint lists
vs. ASQA: Lfrqa is multi-domain (7 domains) vs. ASQA's specific focus
vs. ELI5: Lfrqa provides verified citations to source docs for quality control
+ 1 more
vs. RGB [not cited in paper]: RGB focuses on noise robustness and counterfactuals, while Lfrqa focuses on multi-hop aggregation and cross-domain generalization

Limitations

Only 41.3% win rate for best model suggests dataset is very hard, or human authors had access to more info (though restricted to docs)
Evaluation relies on proprietary LLMs (GPT-4) as judges, which incurs cost and potential bias
Retrieval component is fixed (ColBERTv2); impact of different retrievers not fully explored in main leaderboard
BioASQ answers are notably shorter than other domains due to factoid nature

Reproducibility

Code: https://github.com/awslabs/rag-qa-arena

📊 Experiments & Results

Evaluation Setup

RAG-QA across 7 domains (Finance, Lifestyle, Recreation, Technology, Science, Writing, BioASQ)

Benchmarks:

Lfrqa (Long-form RAG Question Answering) [New]

Metrics:

Win Rate (against Lfrqa ground truth)
Win+Tie Rate
Elo Rating
Pearson Correlation (Human vs. Model Judge)
Statistical methodology: 95% Confidence Intervals calculated for Elo ratings.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Validation of Lfrqa superiority over RobustQA and GPT-4 baselines via pairwise human evaluation.
Lfrqa Subsample	Win Rate (Human)	1.7	93.3	+91.6
Lfrqa Subsample	Win Rate (Human)	24.3	57.3	+33.0
Rag-qa Arena Leaderboard results showing model performance against Lfrqa ground truth.
Rag-qa Arena	Win Rate vs Lfrqa	30.4	37.1	+6.7
Rag-qa Arena	Win+Tie Rate vs Lfrqa	59.2	62.8	+3.6
Lfrqa Subsample	Pearson Correlation	0.53	0.60	+0.07

Experiment Figures

Distribution of document usage in Lfrqa answers.

Main Takeaways

Lfrqa answers are significantly preferred (93.3% win rate) over RobustQA's concatenated extractive spans, validating the need for long-form ground truth.
Increasing retrieval context from Top-5 to Top-10 improves GPT-4o's win rate against Lfrqa from 37.1% to 41.3%, showing model sensitivity to context amount.
GPT-4o leads the leaderboard but struggled with CoT prompting in 'no answer' scenarios (generating answer in thought trace but 'no answer' in final output), requiring prompt adjustment.
Model-based evaluation (GPT-4-0125-preview) shows high correlation with human judgment (>0.5 Pearson), enabling scalable benchmarking.
Lfrqa is challenging: even the best model (GPT-4o with Top-10 context) loses to the human reference answers 47.1% of the time.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG)
LLM-as-a-Judge evaluation methods
Extractive vs. Generative QA metrics

Key Terms

RAG-QA: Retrieval-Augmented Generative Question Answering—systems that retrieve documents and generate a free-form answer

Lfrqa: Long-form RobustQA—the new dataset proposed in this paper with coherent long-form answers

RobustQA: A prior dataset containing short, extractive answer spans for RAG tasks

Elo rating: A ranking system originally for chess, used here to rank LLM performance based on pairwise win rates

CoT: Chain-of-Thought—a prompting technique encouraging models to 'think' step-by-step before answering

BioASQ: A biomedical semantic indexing and question answering challenge/dataset

ColBERTv2: A specific retrieval model architecture that uses late interaction of token embeddings

Pearson Correlation: A statistical measure of linear correlation between two sets of data (here, human vs. model scores)

Cohen's Kappa: A statistic that measures inter-annotator agreement for categorical items

F1 score: A metric balancing precision and recall, traditionally used for token overlap in extractive QA