LLMs struggle to learn long-tail knowledge

📝 Paper Summary

Internalization through pre-training and mid-training Factuality analysis via training data statistics

LLM accuracy on factual questions is causally determined by the number of times the relevant fact appears in the pre-training corpus, making rare knowledge inherently difficult to learn via scaling alone.

Core Problem

It is unclear why LLMs succeed on some factual questions but fail on others, and whether they can effectively learn 'long-tail' knowledge that appears rarely in pre-training data.

Why it matters:

Understanding the source of LLM capabilities is crucial for predicting performance and limitations on downstream tasks
Blindly scaling model size or data size may be inefficient if the underlying relationship between data frequency and accuracy is log-linear
Identifying failure modes for rare facts motivates architectural changes like retrieval augmentation over simple scaling

Concrete Example: For the question 'In what city was the poet Dante born?', an LLM might answer correctly if 'Dante' and 'Florence' co-occur frequently in its training data (e.g., >100 times), but fail if they co-occur rarely (<10 times), despite knowing who Dante is.

Key Novelty

Entity-Linked Document Counting Analysis

Systematically counts 'relevant documents' in massive pre-training corpora (e.g., The Pile, C4) by finding co-occurrences of question and answer entities (e.g., 'Dante' + 'Florence')
Establishes a causal link (not just correlation) between these counts and QA accuracy by re-training a model on a dataset where specific relevant documents were deleted

Architecture

The pipeline for identifying relevant documents using entity linking.

Evaluation Highlights

BLOOM-176B accuracy on TriviaQA jumps from ~25% to >55% as relevant pre-training documents increase from 100 to 10,000
Scaling laws indicate a model would need 10^18 (one quintillion) parameters to reach competitive accuracy on rare facts (<100 documents)
Retrieval augmentation (BM25) significantly boosts accuracy on rare facts, breaking the dependence on pre-training frequency

Breakthrough Assessment

9/10

A foundational study that quantitatively explains 'why' LLMs know what they know. The finding that scaling is a log-linear dead end for rare facts is a crucial insight for the field.

⚙️ Technical Details

Problem Definition

Setting: Few-shot closed-book question answering (CBQA) grounded in pre-training data statistics

Inputs: Natural language question q

Outputs: Predicted answer a

Pipeline Flow

Pre-training Data Entity Linking
QA Dataset Entity Linking
Document Counting
Correlation/Causal Analysis

System Modules

Pre-training Entity Linker (Data Processing)

Identify entities in massive pre-training corpora

Model or implementation: DBpedia Spotlight Entity Linker

QA Entity Extractor (Data Processing)

Identify salient entities in questions and answers

Model or implementation: DBpedia Spotlight Entity Linker

Co-occurrence Counter

Count documents containing both Q and A entities

Model or implementation: Deterministic counting algorithm

Novel Architectural Elements

Large-scale entity-linking pipeline applied to full pre-training corpora (2.1TB of data) to ground downstream performance in training data statistics

Modeling

Base Model: Analyzes multiple families: BLOOM (560M-176B), GPT-Neo (125M-20B), GPT-3 (Ada-Davinci)

Training Method: Standard autoregressive language modeling (pre-training)

Training Data:

C4 (305GB)
The Pile (825GB)
ROOTS (490GB English subset)
OpenWebText (39GB)
Wikipedia (Dec 2018 dump)

Key Hyperparameters:

counterfactual_model_size: 4.8B parameters
training_epochs: 1

Compute: Entity linking 2.1TB of data took 3 weeks on a 128-CPU-core machine. Model training (for causal experiment) involved training a 4.8B parameter model for 1 epoch.

Comparison to Prior Work

vs. Elazar et al.: Focuses specifically on the prevalence of knowledge (relevant document counts) rather than structural heuristics; includes re-training experiments
vs. Mallen et al.: Uses direct entity linking on pre-training data instead of proxies like Wikipedia popularity; demonstrates causal link via data deletion

Limitations

Entity linking pipeline is imperfect (approx. 60% precision on TriviaQA samples)
Proxy estimation used for GPT-3 training data (OpenWebText scaled up) introduces uncertainty
Analysis restricted to English subsets of multilingual corpora (ROOTS)
Focuses only on factoid QA, may not generalize to other reasoning tasks

Reproducibility

Code: https://github.com/nkandpa2/long_tail_knowledge

📊 Experiments & Results

Evaluation Setup

Few-shot (4-shot) closed-book QA, measuring Exact Match (EM) accuracy against relevant document counts

Benchmarks:

TriviaQA (Open-domain Factoid QA)
Natural Questions (Open-domain Factoid QA)

Metrics:

Exact Match (EM) Accuracy
Relevant Document Count (Independent Variable)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Correlation analysis shows a strong log-linear relationship between the number of relevant documents in pre-training data and QA accuracy across multiple model families.
TriviaQA	EM Accuracy	25.0	55.0	+30.0
Natural Questions	Model Size	100000000000	1000000000000000000	+999999900000000000
TriviaQA	Accuracy Drop	0.14	0.02	-0.12
Natural Questions	EM Accuracy	0.05	0.28	+0.23

Experiment Figures

Plot of QA Accuracy vs. Number of Relevant Pre-training Documents for BLOOM models on TriviaQA.

Scaling trend line for rare fact learning on Natural Questions.

Main Takeaways

Strong log-linear relationship: QA accuracy is highly dependent on the number of times the fact appears in the pre-training data.
Causal link confirmed: Removing relevant documents during training directly degrades performance on associated questions.
Scaling is inefficient for the long tail: To learn rare facts via scaling alone requires prohibitively large models (e.g., 10^18 parameters).
Retrieval is the solution: Retrieval-augmented models largely mitigate the dependence on pre-training frequency, maintaining high accuracy even for rare facts.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model pre-training
Knowledge of entity linking and co-occurrence statistics
Familiarity with open-domain Question Answering (QA) benchmarks

Key Terms

Long-tail knowledge: Facts or information that appear very rarely in the pre-training dataset

Relevant document: A document in the pre-training corpus that contains both the salient entity from the question and the salient entity from the answer

Entity linking: The process of identifying entities (people, places, things) in text and linking them to a unique identifier in a knowledge base (e.g., DBpedia)

BM25: Best Matching 25—a probabilistic information retrieval function used to rank documents based on query terms

Exatch Match (EM): A metric that measures if the predicted answer string exactly matches one of the ground truth answers

Counterfactual re-training: An experimental method where a model is trained on a modified dataset (e.g., with specific documents removed) to test causal hypotheses