← Back to Paper List

Controlled Retrieval-augmented Context Evaluation for Long-form RAG

Unknown authors
University of Amsterdam, Leiden University, Johns Hopkins University, Human Language Technology Center of Excellence
arXiv
RAG Benchmark QA

📝 Paper Summary

Modularized RAG pipeline Long-form generation
CRUX is an evaluation framework for long-form RAG that uses human summaries to create controlled oracle retrieval contexts, enabling direct measurement of retrieval completeness and redundancy independent of the final generation.
Core Problem
Standard retrieval metrics (e.g., Recall, MRR) focus on relevance ranking but fail to measure whether the retrieved context contains all necessary information (completeness) or too much repetition (redundancy) for long-form generation tasks.
Why it matters:
  • Suboptimal retrieval contexts lead to incomplete or misleading long-form reports, even with powerful generators
  • Current evaluation practices designed for short-answer QA or web search do not capture the multi-aspect coverage required for comprehensive long-form responses
  • Redundant retrieval restricts knowledge diversity, undermining the utility of augmented context within token limits
Concrete Example: For a query about 'US employment report', a standard retriever might return multiple similar passages about 'unemployment rate dropping' (high relevance, high redundancy) while missing crucial details about 'wage growth' or 'sector analysis', leading to an incomplete final report.
Key Novelty
Controlled Retrieval-augmented Context Evaluation (CRUX)
  • Uses human-written multi-document summaries as 'oracle' answers to reverse-engineer the perfect retrieval context, establishing an explicit upper bound for evaluation
  • Evaluates retrieval quality using 'coverage' (how many necessary sub-questions are answered by the retrieved text) rather than just keyword matching or ranking position
  • Introduces a 'density' metric to penalize retrieval contexts that are answer-rich but inefficiently long compared to the oracle context
Evaluation Highlights
  • Proposed coverage metrics show strong ranking correlation (Kendall's τ ≈ 0.7-0.8) with the quality of the final generated text, significantly outperforming standard ranking metrics like nDCG (τ < 0.6)
  • High alignment with human judgment: Spearman correlation ρ ≥ 0.8 between automated LLM-based coverage scores and human annotations
  • Standard retrieval methods (e.g., BM25, Dense Retrieval) achieve poor coverage compared to the oracle upper bound (e.g., 34.2 vs 64.6 on DUC), revealing significant room for improvement
Breakthrough Assessment
7/10
Offers a necessary shift from relevance-based to coverage-based evaluation for long-form RAG. While the methodology is sound and diagnostic, it relies on specific summarization datasets for the 'controlled' aspect, potentially limiting immediate application to arbitrary custom corpora.
×