← Back to Paper List

The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models

R Pradeep, N Thakur, S Upadhyay, D Campos…
University of Waterloo, Microsoft, National Institute of Standards and Technology
arXiv, 4/2025 (2025)
RAG Benchmark QA

📝 Paper Summary

Modularized RAG pipeline Metrics and evaluation
AutoNuggetizer modernizes the TREC nugget evaluation methodology by using LLMs to automatically identify and assign atomic facts (nuggets) in RAG answers, achieving high correlation with human assessors.
Core Problem
Evaluating long-form RAG responses is difficult because manual assessment is labor-intensive and non-scalable, while existing automatic metrics often lack correlation with human judgment on complex information needs.
Why it matters:
  • The lack of standardized, scalable evaluations hinders progress in RAG systems, as manual evaluation is too slow and expensive for rapid iteration
  • Current automatic metrics often fail to capture whether a system synthesized the specific atomic facts (nuggets) required to answer a complex query
Concrete Example: For the query 'how did african rulers contribute to the triangle trade', a system might generate fluent text that misses key facts (e.g., 'captured people during wars'). Manual assessors catch this by checking for specific 'nuggets', but standard overlap metrics might miss the semantic omission.
Key Novelty
AutoNuggetizer Framework
  • Refactors the 2003 TREC QA nugget methodology for the LLM era: uses GPT-4o to extract atomic facts (nuggets) from relevant documents instead of manual curation
  • Automates the 'grading' phase: uses an LLM to determine if a system's answer contains those nuggets (Vital vs. Okay), rather than human assessors reading every answer
Evaluation Highlights
  • Fully automatic nugget evaluation shows strong run-level correlation with human-based variants (Kendall's tau > 0.8 in many settings)
  • Automating only the nugget assignment (grading) step yields stronger agreement with manual ground truth than fully automating both creation and assignment
  • LLM assessors tend to be stricter than NIST human assessors when assigning nuggets to answers
Breakthrough Assessment
7/10
Provides a rigorous validation of LLM-based evaluation against high-quality NIST human judgments. While the methodology is a 'refactoring' of old techniques, the validation at TREC scale makes it a significant practical contribution.
×