KILT: a Benchmark for Knowledge Intensive Language Tasks

📝 Paper Summary

Benchmark datasets Metrics and evaluation Modularized RAG pipeline

KILT unifies eleven knowledge-intensive datasets (QA, fact checking, entity linking, slot filling, dialogue) onto a single Wikipedia snapshot to enable comparing task-agnostic retrieval and generation models.

Core Problem

Knowledge-intensive NLP tasks (like QA and fact checking) typically use different, incompatible knowledge sources and pre-processing formats, making it impossible to evaluate general-purpose retrievers or representations across tasks.

Why it matters:

Researchers cannot assess if a single knowledge representation (e.g., dense index) works across diverse tasks if every dataset uses a different Wikipedia version
Comparing architectures is computationally expensive because each task currently requires indexing different large-scale corpora
Task-specific engineering prevents the emergence of general-purpose memory architectures

Concrete Example: A model pre-trained on the 2018 Wikipedia dump for Open-Domain QA cannot be fairly evaluated on FEVER (fact checking) if FEVER relies on a 2017 dump, as the underlying evidence pages may have changed, moved, or been deleted.

Key Novelty

Unified In-KB Benchmark (KILT)

Maps 11 distinct datasets (spanning 5 tasks) to a single, shared snapshot of Wikipedia (5.9M articles), ensuring all ground truth evidence is available in one consistent corpus
Introduces provenance-aware metrics that only award accuracy points if the model also retrieves the correct supporting evidence (text span or page)
Provides a common interface (JSON lines) where every instance includes input, output, and a provenance span ID from the shared knowledge source

Architecture

The common KILT interface applied to five different tasks (Fact Checking, Entity Linking, Slot Filling, QA, Dialogue).

Evaluation Highlights

RAG (Retrieval-Augmented Generation) achieves state-of-the-art results on Open-Domain QA and Fact Checking, significantly outperforming task-specific baselines like NSMN on FEVER (+20% accuracy)
Jointly training a dense retriever (DPR) on all KILT tasks (Multi-task DPR) improves retrieval R-Precision by up to +44 points compared to single-task DPR
Generative models (BART) perform surprisingly well on Entity Linking (77.55% accuracy on AIDA CoNLL-YAGO) without explicit retrieval, solely by generating the correct entity title

Breakthrough Assessment

9/10

Foundational benchmark that standardized evaluation for retrieval-augmented generation. It enabled the development of general-purpose retrievers like DPR and RAG by providing a unified testbed.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-Intensive Language Tasks where an input query requires access to external knowledge to produce an output

Inputs: Natural language query x (e.g., question, claim, conversation history, text chunk)

Outputs: Output y (e.g., answer, label, entity, response) AND provenance set P (list of Wikipedia pages/spans)

Pipeline Flow

Input Processing (Query formulation)
Retrieval (Dense or Sparse search over 5.9M Wikipedia pages)
Reader/Generator (Prediction based on retrieved context)

System Modules

Retriever

Identify relevant Wikipedia pages/passages from the shared snapshot

Model or implementation: DPR (Dense Passage Retrieval) or TF-IDF

Generator / Reader

Produce the final output (answer/label) conditioning on retrieved text

Model or implementation: BART-Large or BERT-Base

Novel Architectural Elements

Unified mapping strategy: Algorithms to align disparate dataset ground-truths (spans, entities) to a single 2019/08/01 Wikipedia snapshot using BLEU-based span matching and redirection handling

Modeling

Base Model: BART (large), BERT (base, large), T5 (base), DPR

Training Method: Supervised training on KILT task data; RAG trains retriever and generator end-to-end

Objective Functions:

Purpose: Maximize log-likelihood of the correct output given the input (and retrieved documents).

Formally: Standard seq2seq cross-entropy loss.

Adaptation: Fine-tuning

Trainable Parameters: Varies by baseline (e.g., RAG: 626M parameters + 15B index)

Training Data:

Merged training sets from 11 datasets
Provenance mapping: If original evidence not found in KILT snapshot (BLEU < 0.5), instance removed from Dev/Test (approx 18% removed)

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. NSMN: RAG uses generative approach + dense retrieval vs. discriminative matching
vs. BLINK: KILT treats EL as a seq2seq generation task vs. classification/ranking
vs. REALM [not cited in paper]: REALM pre-trains retriever, KILT focuses on benchmarking downstream performance across diverse tasks

Limitations

In-KB assumption: KILT removes unanswerable instances, which are crucial for real-world robustness
Provenance mapping loss: Approx 18% of original dev/test data discarded because evidence couldn't be mapped to the specific snapshot
Strict provenance metrics: KILT scores may penalize correct answers if the model retrieves a valid duplicate page not listed in gold provenance (mitigated partially by annotation campaign)

Reproducibility

Code: https://github.com/facebookresearch/KILT

publicly available (https://github.com/facebookresearch/KILT). Provides scripts for mapping, data loaders, evaluation, and baseline implementations (DPR, RAG, BART). Pre-processed data available on HuggingFace.

📊 Experiments & Results

Evaluation Setup

Knowledge-intensive tasks using a fixed Wikipedia snapshot (2019/08/01)

Benchmarks:

FEVER (Fact Checking)
AIDA CoNLL-YAGO (AY2) (Entity Linking)
WNED-WIKI (WnWi) (Entity Linking)
WNED-CWEB (WnCw) (Entity Linking)
T-REx (Slot Filling)
Zero Shot RE (zsRE) (Slot Filling)
Natural Questions (NQ) (Open Domain QA)
HotpotQA (HoPo) (Open Domain QA)
TriviaQA (TQA) (Open Domain QA)
ELI5 (Open Domain QA)
Wizard of Wikipedia (WoW) (Dialogue)

Metrics:

Accuracy
Exact Match (EM)
ROUGE-L
F1-score
R-precision (Page-level retrieval)
KILT-AC / KILT-EM / KILT-RL (Provenance-aware metrics)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Downstream performance comparison showing generative models with retrieval (RAG) generally outperforming or matching task-specific and non-retrieval baselines.
FEVER	Accuracy	66.1	86.31	+20.21
Natural Questions (NQ)	Exact Match	21.75	44.39	+22.64
T-REx	Accuracy	45.06	59.2	+14.14
AIDA CoNLL-YAGO (AY2)	Accuracy	81.54	72.62	-8.92
Retrieval performance (R-Precision) demonstrates the benefit of Multi-task training for dense retrievers.
FEVER	R-Precision	55.33	74.48	+19.15
T-REx	R-Precision	13.26	69.46	+56.20
KILT Score (provenance-aware) results reveal the gap between answering correctly and retrieving correct evidence.
FEVER	KILT-Accuracy	41.88	53.45	+11.57
Natural Questions (NQ)	KILT-EM	32.69	32.69	0.00

Main Takeaways

Explicit retrieval (RAG/DPR) is essential: Models with explicit access outperform implicit knowledge models (BART/T5) on almost all tasks, especially Open-Domain QA and Fact Checking.
Multi-task retrieval helps: Training a single DPR model on all KILT datasets drastically improves retrieval performance compared to single-task training, suggesting strong synergies between tasks.
Generative EL is viable: Seq2seq models perform competitively on Entity Linking by generating titles, a novel finding compared to traditional classification approaches.
Provenance gap: There is a significant drop between standard metrics and KILT scores, indicating that models often guess correctly without retrieving the correct evidence.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Open-Domain Question Answering
Familiarity with Dense Passage Retrieval (DPR) and RAG
Knowledge of Wikipedia-based datasets (FEVER, Natural Questions, etc.)

Key Terms

Provenance: The specific evidence (text span or document ID) in the knowledge source that justifies a model's prediction

KILT score: A strict metric (e.g., KILT-EM) that counts a prediction as correct ONLY if the model also retrieves the correct provenance evidence

In-KB: A setting where the evidence required to answer an instance is guaranteed to be present in the provided knowledge source

R-precision: A retrieval metric measuring the proportion of relevant documents in the top-R retrieved results, where R is the number of relevant documents

Slot Filling: The task of extracting attributes or relations for entities from text (e.g., Subject: Einstein, Relation: educated_at, Object: ETH Zurich)

Entity Linking: The task of assigning a unique Wikipedia page to entity mentions in text

Fact Checking: Verifying a claim against evidence, usually resulting in a Supported/Refuted label

DPR: Dense Passage Retrieval—a method using dual-encoder bi-encoders to retrieve relevant passages based on embedding similarity

RAG: Retrieval-Augmented Generation—a model that combines a neural retriever with a sequence-to-sequence generator

BART: A denoising autoencoder for pre-training sequence-to-sequence models