← Back to Paper List

Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics

Y Hua, G Castellucci, P Schulam, H Elfardy, K Small
Department of Computer Science and Cornell Tech, Cornell University, Amazon
arXiv, 1/2026 (2026)
RAG RL QA

📝 Paper Summary

Modularized RAG pipeline Evaluation methodology
GroGU is a reference-free metric that estimates the utility of retrieved documents for RAG by measuring the change in an LLM's generation confidence (specifically key-token entropy) when conditioned on those documents.
Core Problem
Existing metrics for tuning RAG components rely on costly annotated references or noisy, LLM-agnostic retriever scores that fail to capture how useful a specific document is for a specific generator model.
Why it matters:
  • Reference-based metrics require expensive human annotation for every new domain and fail where 'correct' answers are hard to define
  • Retriever relevance scores are noisy (precision < recall) and LLM-agnostic, ignoring that different models derive different utility from the same document
  • Irrelevant documents can sometimes improve generation for specific models (noise robustness), a nuance standard relevance scores miss
Concrete Example: Two LLMs (Qwen-2-1.5b and Phi-4) both fail to answer 'Who lives in the blue house in Balamory?' without grounding. When given the *same* document, Phi-4 answers correctly while Qwen still fails, showing that utility depends on the specific model, not just the document's general relevance.
Key Novelty
Grounding Generation Utility (GroGU)
  • Defines utility as the reduction in an LLM's uncertainty (entropy) when generating an answer with grounding documents versus without them
  • Introduces 'KeyEntropy' to focus measurement only on tokens that change significantly when grounded, filtering out scaffolding phrases (e.g., 'The answer is...') that skew confidence scores
Evaluation Highlights
  • +18.2 points in Mean Reciprocal Rank (MRR) for retrieval when training a query-rewriter using GroGU signals instead of relevance scores
  • +9.4 percentage points in answer accuracy for the downstream generator using the GroGU-optimized rewriter
  • KeyEntropy metric achieves 0.377 correlation (Kendall's tau) with actual generation correctness, significantly outperforming relevance scores which negatively correlate
Breakthrough Assessment
7/10
Strong practical contribution for automating RAG tuning without labels. The KeyEntropy formulation addresses a specific failure mode of perplexity. Gains are significant, though the scope is currently demonstrated on query rewriting.
×