← Back to Paper List

RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering

R Han, Y Zhang, P Qi, Y Xu, J Wang, L Liu, WY Wang…
AWS AI Labs, Google, Samaya.ai, Orby.ai, University of California, Santa Barbara
arXiv, 7/2024 (2024)
RAG Benchmark QA

📝 Paper Summary

Modularized RAG pipeline Benchmark datasets Metrics and evaluation
Lfrqa provides high-quality human-written long-form answers aggregating multiple documents across domains, enabling Rag-qa Arena to robustly evaluate RAG systems using LLM judges instead of noisy retrieval-based metrics.
Core Problem
Existing RAG-QA benchmarks (like RobustQA) use short extractive answers that penalize modern LLMs' long-form responses via token overlap metrics, while naive concatenation of short answers yields incoherent references.
Why it matters:
  • Real-world RAG systems generate comprehensive narratives, not just short spans, making extractive metrics (EM, F1) unsuitable
  • Current benchmarks lack high-quality cross-domain long-form references, hindering the measurement of out-of-domain (OOD) robustness
  • Evaluation typically requires checking retrieved passages, which is noisy; evaluating directly against a high-quality 'gold' answer is cleaner
Concrete Example: In RobustQA, the answer to 'Are password manager apps safe?' consists of disjoint fragments like 'many ways to compromise' and 'mostly anecdotal evidence'. When an LLM generates a fluent paragraph explaining safety nuances, it gets a low F1 score against these fragments despite being correct.
Key Novelty
Long-form RobustQA (Lfrqa) & Rag-qa Arena
  • Constructs a dataset where human annotators integrate multiple short extractive spans into single, coherent long-form narratives across 7 domains
  • Establishes an evaluation framework using LLMs as judges to compare model outputs directly against these comprehensive ground-truth answers (pairwise preference)
  • Demonstrates that evaluating against Lfrqa answers correlates highly with human judgment, eliminating the need to reference potentially noisy retrieved documents during evaluation
Architecture
Architecture Figure Figure 4
The Rag-qa Arena evaluation framework workflow.
Evaluation Highlights
  • Lfrqa answers are preferred over RobustQA's concatenated extractive answers with a 93.3% win rate by human judges
  • Only 41.3% of GPT-4o's answers (using top-10 retrieval) are preferred over Lfrqa's human-written ground truth, showing the benchmark remains challenging
  • GPT-4-0125-preview as a judge achieves 0.52 Pearson correlation with human annotators, validating the model-based evaluation framework
Breakthrough Assessment
8/10
Significantly improves RAG evaluation by moving from extractive metrics to semantic long-form comparison. The dataset addresses a critical gap (coherent multi-doc aggregation) and the arena framework is practical for modern LLMs.
×