← Back to Paper List

KDD Cup CRAG Competition: Systems, Finding, and Learning

Xiao Yang, Yifan Ethan Xu, Kai Sun, Jiaqi Wang, Lingkun Kong, Wen-tau Scott Yih, Xin Luna Dong
Meta Reality Labs, Fundamental AI Research, Meta
IEEE Data Engineering Bulletin (2024)
RAG Benchmark Factuality KG QA

📝 Paper Summary

RAG Benchmarking Retrieval-Augmented Generation Competitions
The CRAG benchmark and KDD Cup competition evaluate RAG systems across diverse domains and dynamism levels, revealing that while RAG outperforms LLMs, significant gaps remain in handling dynamic and complex queries.
Core Problem
Existing RAG benchmarks are limited in scope, diversity, and realism, often failing to test mock APIs, dynamic temporal questions, or long-tail facts, leading to unreliable evaluation of hallucination and accuracy.
Why it matters:
  • LLMs struggle with hallucinations and lack up-to-date knowledge, necessitating robust RAG systems.
  • Current benchmarks (e.g., NQ, MS MARCO) lack structured retrieval (Knowledge Graphs) and dynamic questions (real-time data).
  • Industry RAG systems need to balance retrieval precision, answer faithfulness, and latency, but standard evaluations don't capture these trade-offs effectively.
Concrete Example: A user asks 'What is the opening stock price of landp last Friday?'. A standard LLM might hallucinate a number. A simple RAG might retrieve an old price. A robust system must parse 'last Friday' relative to the query time (02/28/2024) and use a finance API to get the specific real-time data.
Key Novelty
Comprehensive RAG (CRAG) Benchmark & Competition Framework
  • Introduces a benchmark with 4.4K QA pairs covering 5 domains (finance, sports, music, movie, open) and 4 dynamism levels (real-time to static).
  • Provides mock APIs for Knowledge Graph access alongside web search results, testing structured vs. unstructured retrieval integration.
  • Implements a hybrid evaluation scoring 'Truthfulness' that heavily penalizes hallucinations (-1 score) while rewarding admitting ignorance (0 score for 'I don't know').
Evaluation Highlights
  • Top-1 winner (team db3) achieved 36.2% Truthfulness, significantly outperforming the Llama 3 baseline (9.1%) and LLM-only baseline (3.4%).
  • Industry SOTA systems (e.g., Perplexity.ai, Copilot) reached up to 50.6% Truthfulness, showing a large gap between competition constraints and unconstrained commercial systems.
  • Simple RAG baselines only improved Truthfulness by ~6-9% over LLM-only approaches because careless summarization introduced new hallucinations from noisy retrieval.
Breakthrough Assessment
8/10
CRAG creates a necessary, realistic standard for RAG by including mock APIs and dynamic questions. The findings on 'hallucination vs. missing' trade-offs are highly valuable for practical system design.
×