← Back to Paper List

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona
Technion – Israel Institute of Technology, Google Research
arXiv (2026)
Factuality Benchmark Reasoning

πŸ“ Paper Summary

Factual knowledge evaluation Parametric knowledge vs. Recall Inference-time compute (Thinking)
Factual errors in frontier LLMs stem primarily from failures to access encoded knowledge (recall failures) rather than missing knowledge, a bottleneck that inference-time thinking can partially unlock.
Core Problem
Standard factuality evaluations treat all errors alike, failing to distinguish whether an LLM lacks the knowledge entirely (encoding failure) or simply cannot access it under specific conditions (recall failure).
Why it matters:
  • Scaling model size improves encoding but leaves a massive recall gap, meaning larger models still fail to use what they know
  • Distinguishing error types is crucial: encoding failures need pre-training interventions, while recall failures require post-training or inference-time solutions like 'thinking'
  • Current metrics obscure the root causes of phenomena like the reversal curse and long-tail hallucinations
Concrete Example: An LLM might correctly complete 'Oasis played their first gig at the...' with 'Boardwalk club' (encoded), but fail to answer 'Which band played their first gig at the Boardwalk club?' (recall failure). This shows the knowledge exists but is inaccessible via the reverse query.
Key Novelty
Knowledge Profiling Framework & WikiProfile Benchmark
  • Classifies facts into profiles based on accessibility: 'Direct Recall' (known instantly), 'Recall with Thinking' (needs compute), and 'Recall Failure' (encoded but inaccessible), separating storage from access
  • Operationalizes 'Encoding' behaviorally by testing if a model can reproduce a fact when primed with its exact pre-training context, bypassing the need for weight access
  • Demonstrates that 'thinking' (inference-time compute) acts as a recovery mechanism for 'lost keys'β€”memories that are stored but temporarily inaccessible
Evaluation Highlights
  • Encoding is nearly saturated: GPT-5 and Gemini-3 encode 95–98% of facts in WikiProfile, yet fail to recall 25–33% without thinking
  • Thinking recovers 40–65% of facts that are encoded but not directly known in thinking-optimized LLMs
  • The 'reversal curse' is a recall failure: LLMs recognize correct reverse answers in multiple-choice (verification) even when they cannot generate them
Breakthrough Assessment
9/10
Fundamentally reframes the understanding of LLM factuality from a storage problem to an access problem. Provides a rigorous methodology for probing closed models and compellingly links 'thinking' to memory recall.
×