Support evaluation for the trec 2024RAGtrack: Comparing human versus llm judges

📝 Paper Summary

Modularized RAG pipeline Evaluation methodology

A large-scale study of TREC 2024 RAG Track submissions reveals that GPT-4o correlates highly with human judges for evaluating whether RAG answers are supported by citations, potentially exceeding average human reliability.

Core Problem

Evaluating RAG systems requires assessing 'support' (whether citations actually back up claims), but scaling human annotation is expensive and slow, while the reliability of LLM judges for this specific task remains unproven at large scale.

Why it matters:

RAG systems are deployed to reduce hallucinations, but without reliable support evaluation, developers cannot verify if citations are accurate or merely decorative
Current evaluation often relies on unvalidated 'automatic judges', but it is unknown if these proxies can replace humans for nuanced fact-checking tasks
Human annotation is costly and prone to inter-annotator disagreement, creating a bottleneck for iterative system improvement

Concrete Example: A RAG system might generate a fluent answer citing a document about 'apple pie' to support a claim about 'apple juice'. A human judge would mark this 'No Support'. The paper investigates if GPT-4o can reliably catch this mismatch across thousands of examples compared to human annotators.

Key Novelty

Large-Scale Human-LLM Comparative Study for RAG Support

Contrasts two human annotation workflows (manual from scratch vs. post-editing LLM predictions) against fully automated GPT-4o judgments on TREC 2024 RAG Track data
Conducts an unbiased disagreement analysis using an expert independent judge to determine ground truth when humans and LLMs differ

Evaluation Highlights

72% perfect agreement between human judges and GPT-4o when humans use post-editing (seeing LLM predictions first), compared to 56% for manual from-scratch
High correlation (Kendall's tau > 0.79) between GPT-4o and human judges for ranking RAG systems by weighted precision and recall
Independent expert judge agreed more with GPT-4o (Cohen's kappa 0.27) than with original human annotators (kappa 0.07) on disagreement cases, suggesting LLMs may be more reliable than crowd workers

Breakthrough Assessment

7/10

Strong empirical evidence validating LLMs as reliable judges for RAG support, challenging the assumption that human annotation is always the gold standard. Valuable for the evaluation community.

⚙️ Technical Details

Problem Definition

Setting: Evaluation of RAG system outputs for citation support

Inputs: Answer sentence a_i and cited passage d_j

Outputs: Support label s_{i,j} (Full Support, Partial Support, No Support)

Pipeline Flow

Participant RAG Systems (generate answers)
Assessment Workflow (Human vs. LLM judges)
Metric Calculation (Weighted Precision/Recall)

System Modules

Automatic Judge (Evaluation)

Predict support labels for sentence-citation pairs

Model or implementation: GPT-4o

Human Judge (Condition 1) (Evaluation)

Assess support from scratch

Model or implementation: Human Annotators (NIST-trained)

Human Judge (Condition 2) (Evaluation)

Review and correct LLM predictions

Model or implementation: Human Annotators (NIST-trained)

Modeling

Base Model: GPT-4o

Compute: Inference via Microsoft Azure API; computational cost for training not applicable (evaluation paper)

Comparison to Prior Work

vs. Liu et al.: Conducts large-scale comparison of Human vs. LLM (GPT-4o) specifically for RAG support using NIST assessors
vs. Es et al. / Chen et al.: Validates the automatic judge against high-quality human judgments and an independent expert review, rather than just proposing the metric
vs. G-Eval [not cited in paper]: Focuses specifically on citation support verification rather than general coherence/fluency evaluation

Limitations

Sparse annotation: Only the first cited passage per sentence was evaluated due to budget constraints
Limited topic set: Evaluation covered only 36 topics
Limited model scope: Only GPT-4o was evaluated as an automatic judge; other LLMs were not compared

Reproducibility

Prompt template provided in Figure 1. Data comes from TREC 2024 RAG Track (MS MARCO V2.1 segment collection). 45 participant systems on 36 topics evaluated. Code availability not explicitly mentioned.

📊 Experiments & Results

Evaluation Setup

TREC 2024 RAG Track submissions (45 systems, 36 topics)

Benchmarks:

TREC 2024 RAG Track (Retrieval-Augmented Generation)

Metrics:

Weighted Precision
Weighted Recall
Kendall's tau (correlation)
Cohen's kappa (agreement)
Statistical methodology: Correlation analysis (Kendall's tau) and inter-annotator agreement (Cohen's kappa)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Agreement analysis showing how often GPT-4o matches human judgments under different annotation conditions.
TREC 2024 RAG Track	Exact Match %	56	56	0
TREC 2024 RAG Track	Exact Match %	72	72	0
Correlation analysis shows GPT-4o ranks systems similarly to human judges.
TREC 2024 RAG Track	Kendall's tau	1.0	0.79	-0.21
Disagreement study with an independent expert judge reveals GPT-4o may be more reliable than the original human annotators.
Disagreement Sample (537 pairs)	Cohen's kappa	0.07	0.27	+0.20

Main Takeaways

GPT-4o provides a scalable and reliable alternative to human judges for RAG support evaluation, showing high correlation in system rankings.
Human annotators assisted by LLM predictions (post-editing) have higher agreement with the LLM, suggesting a useful human-in-the-loop workflow.
In cases of disagreement, an independent expert often sided with the LLM over the original human judge, highlighting potential noise or errors in human annotation.
Evaluation was constrained to the first citation per sentence; future work should address multi-citation support.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with evaluation metrics like Precision and Recall
Knowledge of LLM-as-a-judge paradigms

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents

Support: A metric evaluating whether the information in a generated sentence is factually backed up by its cited document

LLM-as-a-judge: Using a Large Language Model (like GPT-4) to evaluate the quality of outputs from other models

Kendall's tau: A statistic used to measure the ordinal association between two measured quantities (e.g., how similarly two judges rank a list of systems)

Cohen's kappa: A statistic that measures inter-annotator agreement for qualitative items, accounting for the possibility of the agreement occurring by chance

weighted precision: A metric in this paper measuring the proportion of citations that support the answer, weighted by support level (1.0 for Full, 0.5 for Partial)

weighted recall: A metric in this paper measuring the proportion of answer sentences supported by citations, weighted by support level

TREC: Text Retrieval Conference—a series of workshops focusing on a list of different information retrieval research areas

post-editing: An annotation workflow where humans review and correct pre-generated labels rather than creating them from scratch