Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

📝 Paper Summary

Factual knowledge evaluation Parametric knowledge vs. Recall Inference-time compute (Thinking)

Factual errors in frontier LLMs stem primarily from failures to access encoded knowledge (recall failures) rather than missing knowledge, a bottleneck that inference-time thinking can partially unlock.

Core Problem

Standard factuality evaluations treat all errors alike, failing to distinguish whether an LLM lacks the knowledge entirely (encoding failure) or simply cannot access it under specific conditions (recall failure).

Why it matters:

Scaling model size improves encoding but leaves a massive recall gap, meaning larger models still fail to use what they know
Distinguishing error types is crucial: encoding failures need pre-training interventions, while recall failures require post-training or inference-time solutions like 'thinking'
Current metrics obscure the root causes of phenomena like the reversal curse and long-tail hallucinations

Concrete Example: An LLM might correctly complete 'Oasis played their first gig at the...' with 'Boardwalk club' (encoded), but fail to answer 'Which band played their first gig at the Boardwalk club?' (recall failure). This shows the knowledge exists but is inaccessible via the reverse query.

Key Novelty

Knowledge Profiling Framework & WikiProfile Benchmark

Classifies facts into profiles based on accessibility: 'Direct Recall' (known instantly), 'Recall with Thinking' (needs compute), and 'Recall Failure' (encoded but inaccessible), separating storage from access
Operationalizes 'Encoding' behaviorally by testing if a model can reproduce a fact when primed with its exact pre-training context, bypassing the need for weight access
Demonstrates that 'thinking' (inference-time compute) acts as a recovery mechanism for 'lost keys'—memories that are stored but temporarily inaccessible

Evaluation Highlights

Encoding is nearly saturated: GPT-5 and Gemini-3 encode 95–98% of facts in WikiProfile, yet fail to recall 25–33% without thinking
Thinking recovers 40–65% of facts that are encoded but not directly known in thinking-optimized LLMs
The 'reversal curse' is a recall failure: LLMs recognize correct reverse answers in multiple-choice (verification) even when they cannot generate them

Breakthrough Assessment

9/10

Fundamentally reframes the understanding of LLM factuality from a storage problem to an access problem. Provides a rigorous methodology for probing closed models and compellingly links 'thinking' to memory recall.

⚙️ Technical Details

Problem Definition

Setting: Behavioral evaluation of factual knowledge in LLMs without access to model weights

Inputs: A fact f (subject-relation-object) derived from a document

Outputs: Classification of the fact into a knowledge profile (e.g., Recall Failure, Direct Recall) based on response accuracy across different prompting contexts

Pipeline Flow

Fact Extraction: Identify entities/facts from Wikipedia
Question Generation: Create probing tasks (completion, direct/reverse QA, multiple-choice)
Response Generation: Query target LLMs (with/without thinking)
Evaluation: Auto-rate responses to classify facts into profiles

System Modules

Fact Extractor (Benchmark Creation)

Selects object entities from Wikipedia documents that are non-trivial and unique given the left context

Model or implementation: Gemini-2.5-Pro (as part of automated pipeline)

Question Generator (Benchmark Creation)

Generates 10 distinct tasks per fact: 2 encoding probes (completion), 4 knowledge probes (direct/reverse QA), 4 verification probes (MCQA)

Model or implementation: Gemini-2.5-Pro (grounded by Google Search)

Autorater

Grades model responses as CORRECT, INCORRECT, or OTHER

Model or implementation: Gemini-2.5-Pro

Novel Architectural Elements

Behavioral definition of 'Encoding' using context-primed completion tasks to proxy for parametric storage in closed-weight models
Knowledge Profiling taxonomy that explicitly separates 'Thinking' as a distinct accessibility tier

Modeling

Base Model: Evaluated 13 LLMs including Gemini-3, GPT-5, Gemma3, GPT-4.1

Training Method: Evaluation only (paper proposes a benchmark and framework, does not train a new model)

Compute: Generated ~4.5 million responses across 13 models. Inference costs not explicitly reported.

Comparison to Prior Work

vs. Latent Knowledge Probes: This behavioral framework works on closed-source models via prompting rather than requiring weight access
vs. Reversal Curse studies: Shows models CAN recognize reverse facts (verification), proving the curse is a recall failure, not a learning failure
vs. Standard QA Benchmarks (e.g., TriviaQA): Shifts unit of analysis from 'questions' to 'facts', probing the same fact across multiple modalities to diagnose the error source

Limitations

Relies on 'Autoraters' (LLM judges), though agreement with human/cross-model judges is high (>98%)
Definition of encoding assumes that if a model cannot complete the exact training context, it doesn't encode the fact (potentially false negatives)
Focuses on Wikipedia facts, which may not represent all types of knowledge domains

📊 Experiments & Results

Evaluation Setup

Zero-shot probing of 13 LLMs on 2,150 facts using the WikiProfile benchmark.

Benchmarks:

WikiProfile (Factuality Profiling (Completion, QA, MCQA)) [New]

Metrics:

Encoding Rate (% of facts reproducible in context)
Direct Recall Rate (% of encoded facts known without thinking)
Thinking Recall Rate (% of encoded facts known ONLY with thinking)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Encoding vs. Recall analysis shows that while frontier models store almost all facts, they struggle to access them.
WikiProfile	Encoding Rate	98.1	98.1	0.0
WikiProfile	Direct Recall Rate	67.3	67.3	0.0
WikiProfile	Recall Gain (Low Popularity)	63.3	84.7	+21.4
WikiProfile	Recall Gap (Direct vs Reverse)	83.0	74.0	-9.0
WikiProfile	Verification Gap (Direct vs Reverse)	86.0	90.7	+4.7

Main Takeaways

Scaling fills 'empty shelves' (encoding) but not 'lost keys' (recall): Larger models encode more, but the gap between what they store and what they can access remains significant.
The 'Reversal Curse' is strictly a generation failure; verification tasks show models actually possess the bidirectional associations they fail to generate.
Thinking (inference-time compute) preferentially aids the recall of encoded facts (recovering 40-65%) rather than enabling inference on non-encoded facts (5-15%).
Fact popularity affects recall far more than encoding: rare facts are encoded almost as well as popular ones, but are much harder to retrieve.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and pre-training
Familiarity with the 'Reversal Curse' in LLMs
Basic concepts of Chain-of-Thought (CoT) prompting

Key Terms

Knowledge Profiling: A framework characterizing facts by whether they are encoded (stored) and how accessible they are (recallable with/without thinking)

Encoding: Operationalized as the ability of an LLM to correctly complete a factual proposition given a context mimicking its pre-training data

Recall: The ability of an LLM to correctly answer questions about an encoded fact across different contexts (phrasings, direct/reverse directions)

Thinking: Inference-time computation (like Chain-of-Thought or internal reasoning traces) used before generating a final answer

Reversal Curse: The phenomenon where an LLM knows 'A is B' but fails to answer 'What is B?' (e.g., knows Oasis -> Boardwalk, but not Boardwalk -> Oasis)

WikiProfile: The new benchmark dataset introduced in this paper, containing 2,150 facts extracted from Wikipedia with associated probing questions

Lost Keys: Metaphor for facts that are encoded in the model parameters but inaccessible during standard inference

Empty Shelves: Metaphor for facts that were never learned or encoded by the model during pre-training

Direct Question: A question asking for the object given the subject (matches training order)

Reverse Question: A question asking for the subject given the object (reverses training order)

Autorater: An LLM-based grader used to evaluate the correctness of model responses against gold answers

Fact: Defined as a proposition involving an ordered pair of entities (subject and object) extracted from a source text