When not to trust LMs: Investigating effectiveness of parametric and non-parametric memories

📝 Paper Summary

Modularized RAG pipeline RAG triggering

Large language models struggle to memorize long-tail factual knowledge, but a simple adaptive method can predict when to retrieve external information based on entity popularity, improving efficiency and accuracy.

Core Problem

LMs fail to encode long-tail factual knowledge in their parameters, and scaling model size does not significantly improve performance on less popular entities.

Why it matters:

Relying solely on parametric memory requires prohibitively large models, yet knowledge still becomes obsolete or hallucinatory for rare entities
Always retrieving external knowledge is computationally expensive and can hurt performance on popular questions where the model already knows the answer (due to misleading retrieval contexts)

Concrete Example: For the 4,000 least popular questions in PopQA, scaling from GPT-Neo 6B to GPT-3 davinci-003 only improves accuracy from 16% to 19%, whereas retrieval augmentation can boost it significantly.

Key Novelty

Adaptive Retrieval based on Entity Popularity

Identifies a strong correlation between subject entity popularity (measured by Wikipedia page views) and LM memorization accuracy
Proposes a threshold-based system: if the entity in the question is popular, use the LM's internal memory; if it is rare (long-tail), trigger retrieval
Introduces PopQA, a new dataset specifically designed to probe knowledge across a wide spectrum of entity popularities

Architecture

Conceptual flowchart of Adaptive Retrieval (text-based reconstruction)

Evaluation Highlights

Retrieval-augmented GPT-Neo 2.7B outperforms GPT-3 davinci-003 on the 4,000 least popular PopQA questions
Adaptive Retrieval improves PopQA accuracy by up to 10% compared to non-retrieval baselines while reducing inference costs
Scaling model size (from 1.3B to 175B) yields negligible improvement on long-tail questions (staying below 20% accuracy for the least popular bin)

Breakthrough Assessment

8/10

Provides crucial empirical evidence that scaling laws don't apply to long-tail fact memorization and offers a practical, efficient solution (Adaptive Retrieval) that balances cost and accuracy.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering (QA) probing factual knowledge

Inputs: Natural language question q containing a subject entity s and relationship r

Outputs: Answer string a

Pipeline Flow

Entity Linker (identifies subject entity)
Popularity Check (retrieves page views)
Decision Gate (Retrieval vs. Parametric)
Branch A: Parametric Generation (Standard LM)
Branch B: Retrieval-Augmented Generation (Retriever + Reader)

System Modules

Entity Linker

Identify the subject entity in the question to determine its popularity

Model or implementation: Off-the-shelf entity linker (implementation detail not specified, likely Wikidata API lookup)

Decision Gate (Adaptive Retrieval)

Decide whether to retrieve external context based on entity popularity threshold

Model or implementation: Threshold-based heuristic

Retriever

Fetch relevant passages if triggered

Model or implementation: Contriever (dense) or BM25 (sparse)

Generator

Generate answer using either prompt alone or prompt + retrieved context

Model or implementation: GPT-3 (davinci-002/003), OPT, or GPT-Neo

Novel Architectural Elements

Popularity-based routing mechanism: Explicitly using external popularity metrics (Wikipedia views) rather than model confidence to gate retrieval

Modeling

Base Model: Evaluated multiple: GPT-3 (davinci-002, davinci-003), OPT (1.3B-13B), GPT-Neo (1.3B-20B)

Training Method: Zero-shot or Few-shot prompting (Inference only)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RAG: Selectively retrieves only for long-tail entities, reducing cost and preventing distraction on popular entities
vs. Parametric-only: Significantly higher accuracy on long-tail data where parametric memory fails
vs. Confidence-based Retrieval [not cited in paper]: Uses external popularity proxy instead of internal model confidence/perplexity to trigger retrieval

Limitations

Relies on entity linking and Wikipedia page view statistics which may not exist for all domains
The threshold is a heuristic and might need tuning for different datasets
Focuses only on simple triple-based factual questions, not complex reasoning
Retrieval can sometimes hurt performance on popular entities due to misleading context (distractors)

Reproducibility

Code: https://github.com/AlexTMallen/adaptive-retrieval

Code and PopQA data are publicly available at https://github.com/AlexTMallen/adaptive-retrieval. The paper uses public models (GPT-Neo, OPT) and APIs (GPT-3). Entity popularity data is derived from Wikipedia page views (dump available).

📊 Experiments & Results

Evaluation Setup

Open-domain QA using exact match accuracy

Benchmarks:

PopQA (Entity-centric Factoid QA) [New]
EntityQuestions (Entity-centric Factoid QA)

Metrics:

Exact Match Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaling model size improves memory for popular entities but fails for the long tail.
PopQA (4,000 least popular)	Accuracy	15	16	+1
PopQA (4,000 least popular)	Accuracy	15	19	+4
Retrieval augmentation is highly effective for the long tail, sometimes outperforming much larger non-retrieval models.
PopQA (4,000 least popular)	Accuracy	19	29	+10
Adaptive Retrieval balances performance and cost.
PopQA	Accuracy	35	45	+10

Experiment Figures

Heatmaps and scatter plots showing accuracy vs. entity popularity across different models and relationship types.

Line graphs comparing parametric vs. retrieval-augmented performance across popularity bins.

Main Takeaways

LMs exhibit a 'popularity bias': memorization is strongly correlated with subject entity popularity (page views).
Scaling model size primarily improves accuracy on popular entities; the curve for rare entities remains flat even for GPT-3.
Retrieval is crucial for the long tail but can degrade performance on popular entities (likely due to retrieval noise/distractors).
Adaptive Retrieval based on popularity effectively combines the strengths of parametric memory (for popular facts) and non-parametric memory (for rare facts).

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and zero-shot/few-shot prompting
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of Knowledge Graphs (Wikidata triples)

Key Terms

Parametric memory: Knowledge stored implicitly in the weights (parameters) of a pre-trained neural network

Non-parametric memory: External knowledge sources (e.g., Wikipedia passages) accessed via retrieval at inference time

PopQA: The new dataset introduced in this paper, consisting of 14k questions about long-tail entities derived from Wikidata triples

Long-tail entities: Entities that appear infrequently in training data or real-world usage, often defined here by low Wikipedia page views

Contriever: A dense information retrieval model trained using contrastive learning to match queries with relevant documents

BM25: A widely used ranking function for information retrieval based on exact keyword matching statistics

Greedy decoding: A generation strategy where the model selects the highest probability token at each step

EntityQuestions: An existing open-domain QA dataset used as a secondary benchmark, also featuring long-tail distribution