KDD Cup CRAG Competition: Systems, Finding, and Learning

📝 Paper Summary

RAG Benchmarking Retrieval-Augmented Generation Competitions

The CRAG benchmark and KDD Cup competition evaluate RAG systems across diverse domains and dynamism levels, revealing that while RAG outperforms LLMs, significant gaps remain in handling dynamic and complex queries.

Core Problem

Existing RAG benchmarks are limited in scope, diversity, and realism, often failing to test mock APIs, dynamic temporal questions, or long-tail facts, leading to unreliable evaluation of hallucination and accuracy.

Why it matters:

LLMs struggle with hallucinations and lack up-to-date knowledge, necessitating robust RAG systems.
Current benchmarks (e.g., NQ, MS MARCO) lack structured retrieval (Knowledge Graphs) and dynamic questions (real-time data).
Industry RAG systems need to balance retrieval precision, answer faithfulness, and latency, but standard evaluations don't capture these trade-offs effectively.

Concrete Example: A user asks 'What is the opening stock price of landp last Friday?'. A standard LLM might hallucinate a number. A simple RAG might retrieve an old price. A robust system must parse 'last Friday' relative to the query time (02/28/2024) and use a finance API to get the specific real-time data.

Key Novelty

Comprehensive RAG (CRAG) Benchmark & Competition Framework

Introduces a benchmark with 4.4K QA pairs covering 5 domains (finance, sports, music, movie, open) and 4 dynamism levels (real-time to static).
Provides mock APIs for Knowledge Graph access alongside web search results, testing structured vs. unstructured retrieval integration.
Implements a hybrid evaluation scoring 'Truthfulness' that heavily penalizes hallucinations (-1 score) while rewarding admitting ignorance (0 score for 'I don't know').

Evaluation Highlights

Top-1 winner (team db3) achieved 36.2% Truthfulness, significantly outperforming the Llama 3 baseline (9.1%) and LLM-only baseline (3.4%).
Industry SOTA systems (e.g., Perplexity.ai, Copilot) reached up to 50.6% Truthfulness, showing a large gap between competition constraints and unconstrained commercial systems.
Simple RAG baselines only improved Truthfulness by ~6-9% over LLM-only approaches because careless summarization introduced new hallucinations from noisy retrieval.

Breakthrough Assessment

8/10

CRAG creates a necessary, realistic standard for RAG by including mock APIs and dynamic questions. The findings on 'hallucination vs. missing' trade-offs are highly valuable for practical system design.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-Augmented Question Answering with access to web pages and mock Knowledge Graph APIs.

Inputs: Natural language question Q

Outputs: Natural language answer A

Pipeline Flow

Query Analysis/Routing (classify domain/intent)
Retrieval (Web Search + Knowledge Graph API calls)
Post-processing (Chunking, Filtering, Re-ranking)
Augmented Generation (LLM synthesis with anti-hallucination prompts)

System Modules

Knowledge Retrieval (Web) (Retrieval)

Fetch and process web pages

Model or implementation: Various (e.g., BeautifulSoup for parsing, BGE/BM25 for ranking)

Knowledge Retrieval (KG) (Retrieval)

Query structured data via mock APIs

Model or implementation: LLM-based API caller (Llama-3-8B fine-tuned)

Augmented Generator

Synthesize answer from retrieved context

Model or implementation: Llama-3-8B-Instruct (fine-tuned or prompted)

Novel Architectural Elements

Regularized APIs: A wrapper layer (proposed by winner db3) that adds filtering/aggregation logic (e.g., MAX, AVG) on top of basic lookup APIs to handle complex queries.
Self-Correction Loops: Winning solutions used explicit confidence estimation or self-consistency checks to default to 'I don't know' if confidence was low.

Modeling

Base Model: Llama 3 (8B) or Llama 2 (mandatory constraint for competition)

Training Method: Supervised Fine-Tuning (SFT) for hallucination reduction

Adaptation: Full fine-tuning or LoRA (implied by hardware constraints)

Trainable Parameters: Not reported in the paper

Training Data:

Validation set (30% of CRAG)
Public test set (30% of CRAG)
Synthetic labels generated by LLMs to mark unanswerable queries as 'I don't know'

Compute: AWS G4dn.12xlarge (4x NVIDIA T4 16GB GPUs). Inference time limit: 30 seconds per example.

Comparison to Prior Work

vs. RGB/FreshLLMs: CRAG includes mock APIs and Knowledge Graph retrieval, not just web/text retrieval.
vs. RAGAS/ARES: CRAG focuses on a comprehensive dataset creation rather than just the evaluation metric methodology.
vs. MS MARCO: CRAG is smaller (4.4K pairs) but covers structured data (KG), APIs, and dynamic temporal questions.

Limitations

Auto-evaluation using ChatGPT (GPT-3.5) had 94.5% accuracy compared to human evaluation, introducing some noise.
Hardware constraints (T4 GPUs) limited the use of larger open models (e.g., Llama-3-70B) for the competition submissions.
Fixed retrieval content (cached web pages) ensures fairness but doesn't test the 'Search' component of live RAG systems.

Reproducibility

Benchmark data and leaderboard are public. Winning teams released technical reports. Baseline models (Llama 2/3) are open weights. Competition prohibited proprietary models for submission generation.

📊 Experiments & Results

Evaluation Setup

Three tasks: (1) Retrieval Summarization (given 5 pages), (2) KG & Web RAG (given 5 pages + APIs), (3) End-to-end RAG (given 50 pages + APIs).

Benchmarks:

CRAG (Comprehensive RAG Benchmark) (Open-domain QA with Web/KG retrieval) [New]

Metrics:

Truthfulness (Perfect + 0.5*Acceptable - Hallucination)
Accuracy (Perfect + Acceptable)
Hallucination Rate
Missing Rate ('I don't know')
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of baselines, winning competition solutions, and industry SOTA on the CRAG benchmark (Truthfulness score). Note that Industry SOTA had access to live tools/better models, while competition teams were constrained to Llama-3-8B/Llama-2.
CRAG	Truthfulness	3.4%	36.2%	+32.8%
CRAG	Truthfulness	9.1%	36.2%	+27.1%
CRAG	Truthfulness	50.6%	36.2%	-14.4%
Breakdown of Hallucination Rates. Lower is better.
CRAG	Hallucination Rate	28.9%	17.1%	-11.8%
CRAG	Hallucination Rate	31.6%	17.1%	-14.5%

Main Takeaways

Naive RAG implementations can degrade performance compared to LLM-only baselines (9.1% vs 3.4% Truthfulness is a small gain, but hallucination rate increases from 28.9% to 31.6%) due to noise sensitivity.
The 'Truthfulness' metric design encourages systems to say 'I don't know' (score 0) rather than guess (score -1), driving winners to focus heavily on confidence estimation.
Finance domain questions were most challenging due to real-time/dynamic requirements (e.g., stock prices relative to query time).
Regularized APIs (adding logic like filtering/aggregation to basic lookups) significantly help with Knowledge Graph retrieval.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Knowledge Graphs (KG) and APIs
Large Language Models (LLMs)
Evaluation metrics for QA (Accuracy, Hallucination rate)

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents or structured data

Truthfulness: A composite metric defined as Perfect_rate + 0.5*Acceptable_rate - Hallucination_rate, penalizing wrong answers more than 'I don't know'

Mock APIs: Simulated interfaces provided in the benchmark that mimic accessing structured Knowledge Graphs (e.g., getting a movie director)

Dynamism: Categorization of questions based on how frequently the answer changes (Real-time, Fast-changing, Slow-changing, Static)

Chain-of-thought: A prompting technique where the model generates intermediate reasoning steps before the final answer

Auto-eval: Using an LLM (e.g., ChatGPT) as a judge to grade answers as accurate, missing, or hallucinated

Knowledge Graph (KG): A structured representation of facts (entities and relationships) used for precise data retrieval

BGE: BAAI General Embedding—a popular dense retrieval model used for embedding text