← Back to Paper List

KILT: a Benchmark for Knowledge Intensive Language Tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, Vassilis Plachouras, Tim Rocktäschel, Sebastian Riedel
Facebook AI Research, University College London, LORIA, University of Cambridge, HuggingFace, University of Amsterdam
NAACL (2021)
RAG Benchmark Factuality QA KG

📝 Paper Summary

Benchmark datasets Metrics and evaluation Modularized RAG pipeline
KILT unifies eleven knowledge-intensive datasets (QA, fact checking, entity linking, slot filling, dialogue) onto a single Wikipedia snapshot to enable comparing task-agnostic retrieval and generation models.
Core Problem
Knowledge-intensive NLP tasks (like QA and fact checking) typically use different, incompatible knowledge sources and pre-processing formats, making it impossible to evaluate general-purpose retrievers or representations across tasks.
Why it matters:
  • Researchers cannot assess if a single knowledge representation (e.g., dense index) works across diverse tasks if every dataset uses a different Wikipedia version
  • Comparing architectures is computationally expensive because each task currently requires indexing different large-scale corpora
  • Task-specific engineering prevents the emergence of general-purpose memory architectures
Concrete Example: A model pre-trained on the 2018 Wikipedia dump for Open-Domain QA cannot be fairly evaluated on FEVER (fact checking) if FEVER relies on a 2017 dump, as the underlying evidence pages may have changed, moved, or been deleted.
Key Novelty
Unified In-KB Benchmark (KILT)
  • Maps 11 distinct datasets (spanning 5 tasks) to a single, shared snapshot of Wikipedia (5.9M articles), ensuring all ground truth evidence is available in one consistent corpus
  • Introduces provenance-aware metrics that only award accuracy points if the model also retrieves the correct supporting evidence (text span or page)
  • Provides a common interface (JSON lines) where every instance includes input, output, and a provenance span ID from the shared knowledge source
Architecture
Architecture Figure Figure 1
The common KILT interface applied to five different tasks (Fact Checking, Entity Linking, Slot Filling, QA, Dialogue).
Evaluation Highlights
  • RAG (Retrieval-Augmented Generation) achieves state-of-the-art results on Open-Domain QA and Fact Checking, significantly outperforming task-specific baselines like NSMN on FEVER (+20% accuracy)
  • Jointly training a dense retriever (DPR) on all KILT tasks (Multi-task DPR) improves retrieval R-Precision by up to +44 points compared to single-task DPR
  • Generative models (BART) perform surprisingly well on Entity Linking (77.55% accuracy on AIDA CoNLL-YAGO) without explicit retrieval, solely by generating the correct entity title
Breakthrough Assessment
9/10
Foundational benchmark that standardized evaluation for retrieval-augmented generation. It enabled the development of general-purpose retrievers like DPR and RAG by providing a unified testbed.
×