CRUX: Controlled Retrieval-augmented Context Evaluation—the proposed framework that uses summarization datasets to create oracle retrieval contexts for evaluating RAG
retrieval context: The set of text chunks retrieved from a knowledge source and passed to the LLM to help generate an answer
coverage: A metric measuring the proportion of essential sub-questions (derived from an oracle summary) that are answerable given the retrieved context
density: A metric measuring the information efficiency of the retrieval context—how much coverage is achieved per token compared to an oracle context
sub-question answerability: A binary or graded judgment of whether a specific text passage contains the answer to a specific sub-question
oracle retrieval context: A theoretically ideal set of passages derived from human summaries, containing exactly the information needed to answer the query without redundancy
MMR: Maximal Marginal Relevance—a re-ranking algorithm that balances relevance to the query with diversity among the selected results to reduce redundancy
nDCG: Normalized Discounted Cumulative Gain—a standard information retrieval measure of ranking quality that weights highly relevant documents more when they appear earlier in the list
BM25: Best Matching 25—a probabilistic information retrieval model based on term frequency and inverse document frequency
SPLADE: Sparse Lexical and Expansion Model—a learned sparse retrieval method that expands queries with relevant terms to improve matching
Contriever: A dense retrieval model trained via contrastive learning to embed queries and documents into a vector space
LLM-as-a-judge: Using a Large Language Model to evaluate text quality or answerability instead of human annotators