← Back to Paper List

Support evaluation for the trec 2024RAGtrack: Comparing human versus llm judges

N Thakur, R Pradeep, S Upadhyay, D Campos…
University of Waterloo, Microsoft
arXiv, 4/2025 (2025)
RAG Factuality Benchmark

📝 Paper Summary

Modularized RAG pipeline Evaluation methodology
A large-scale study of TREC 2024 RAG Track submissions reveals that GPT-4o correlates highly with human judges for evaluating whether RAG answers are supported by citations, potentially exceeding average human reliability.
Core Problem
Evaluating RAG systems requires assessing 'support' (whether citations actually back up claims), but scaling human annotation is expensive and slow, while the reliability of LLM judges for this specific task remains unproven at large scale.
Why it matters:
  • RAG systems are deployed to reduce hallucinations, but without reliable support evaluation, developers cannot verify if citations are accurate or merely decorative
  • Current evaluation often relies on unvalidated 'automatic judges', but it is unknown if these proxies can replace humans for nuanced fact-checking tasks
  • Human annotation is costly and prone to inter-annotator disagreement, creating a bottleneck for iterative system improvement
Concrete Example: A RAG system might generate a fluent answer citing a document about 'apple pie' to support a claim about 'apple juice'. A human judge would mark this 'No Support'. The paper investigates if GPT-4o can reliably catch this mismatch across thousands of examples compared to human annotators.
Key Novelty
Large-Scale Human-LLM Comparative Study for RAG Support
  • Contrasts two human annotation workflows (manual from scratch vs. post-editing LLM predictions) against fully automated GPT-4o judgments on TREC 2024 RAG Track data
  • Conducts an unbiased disagreement analysis using an expert independent judge to determine ground truth when humans and LLMs differ
Evaluation Highlights
  • 72% perfect agreement between human judges and GPT-4o when humans use post-editing (seeing LLM predictions first), compared to 56% for manual from-scratch
  • High correlation (Kendall's tau > 0.79) between GPT-4o and human judges for ranking RAG systems by weighted precision and recall
  • Independent expert judge agreed more with GPT-4o (Cohen's kappa 0.27) than with original human annotators (kappa 0.07) on disagreement cases, suggesting LLMs may be more reliable than crowd workers
Breakthrough Assessment
7/10
Strong empirical evidence validating LLMs as reliable judges for RAG support, challenging the assumption that human annotation is always the gold standard. Valuable for the evaluation community.
×