Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory

📝 Paper Summary

Hallucination suppression Knowledge internalization

LLMs often store correct factual knowledge in their parameters even when generating incorrect or 'unsure' answers, which can be recovered by inspecting top-ranked token probabilities.

Core Problem

Standard QA accuracy metrics underestimate LLM knowledge because models frequently output incorrect answers or 'unsure' responses even when the correct answer is present with high probability in the internal logit distribution.

Why it matters:

Current prompting strategies that encourage models to say 'unsure' to reduce hallucinations inadvertently suppress valid knowledge expression
Evaluating models solely on top-1 generation fails to capture the true extent of factual information encoded in parameters
Deployment strategies relying on surface-level accuracy may falsely conclude a model lacks domain knowledge when it actually suffers from conservative decoding

Concrete Example: When asked 'What is the capital of Washington?', a model might output 'Seattle' or 'unsure', yet the correct answer 'Olympia' appears as the second or third most probable token in the model's internal ranking.

Key Novelty

Hits@k for Latent Knowledge Evaluation & 'Unsure' Filtering

Introduces a metric (Hits@k) that counts a model as 'knowing' a fact if the correct answer appears anywhere in the top-k most probable tokens, revealing latent memory
Identifies a 'memory-masking effect' where safety prompts cause models to output 'unsure' despite having the correct answer as a high-probability candidate
Proposes a decoding strategy that filters out uninformative tokens (like 'unsure') and forces the model to select the next best candidate, often recovering the correct fact

Architecture

Conceptual illustration of the knowledge storage-expression gap. Shows a model outputting 'Seattle' (incorrect) for Washington's capital, while the correct answer 'Olympia' has high probability in the distribution.

Evaluation Highlights

On DBpedia, LLaMA3-8b achieves only 17.2% standard accuracy (Hits@1) but reaches 57.9% latent knowledge retention (Hits@5), a massive gap between expression and storage
Newer models like LLaMA3-70b reach 92.1% Hits@5 on DBpedia-head compared to 70.5% for LLaMA2-70b, showing improved knowledge encoding
Filtering 'unsure' responses allows recovering significant portions of correct answers; effectively transforming 'unsure' outputs into correct predictions

Breakthrough Assessment

7/10

Provides strong empirical evidence of the storage-expression gap and challenges the standard practice of encouraging 'unsure' responses without checking latent confidence.

⚙️ Technical Details

Problem Definition

Setting: Open-domain and domain-specific Question Answering (QA)

Inputs: Natural language question q

Outputs: Answer string a (derived from top-ranked tokens)

Pipeline Flow

Input Question -> LLM Forward Pass -> Logit Generation
Logit Analysis (Standard): Select Top-1 Token -> Final Answer
Logit Analysis (Proposed): Inspect Top-k Tokens -> Hits@k Calculation
Recovery (Proposed): If Top-1 is 'Unsure' -> Filter -> Select next best token

System Modules

LLM Backbone

Generate probability distribution over vocabulary given the question

Model or implementation: Various (e.g., LLaMA3-8b, LLaMA2-70b, Mistral-7B)

Token Filter

Identify and remove uninformative tokens (unsure, null, stop words) from top-k candidates

Model or implementation: Heuristic Rule-based

Novel Architectural Elements

Two-stage decoding procedure: generates logits, detects 'unsure' in top-1, filters it out, and selects the next highest probability informative token (novel decoding logic, not model architecture)

Modeling

Base Model: LLaMA3-8b, LLaMA3-70b, LLaMA2-13b/70b, Qwen2, Mistral-7B

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard QA: Evaluates latent knowledge (Hits@k) rather than just surface realization
vs. Confidence Calibration: specifically targets the 'unsure' response masking effect rather than general probability calibration
vs. CAD (Context-Aware Decoding) [not cited in paper]: Focuses on recovering suppressed knowledge in standard QA rather than contrasting context vs. prior

Limitations

Hits@k metric requires string matching which can be noisy with subword tokenization
Analysis focuses on the first token of the answer; multi-token answers are harder to evaluate via single-step logits
Unsure filtering is presented as an analytical probe rather than a production-ready system
Specific domain datasets (IMDB, Goodreads) show higher memory loss than open-domain, limiting recovery potential

Reproducibility

Code: https://github.com/microsoft/LMOps

📊 Experiments & Results

Evaluation Setup

Few-shot Question Answering on Open-Domain and Specific-Domain Knowledge

Benchmarks:

DBPedia (Open-domain QA (Head/Torso/Tail splits))
IMDB (Movie domain QA)
GoodReads (Book domain QA)

Metrics:

Hits@k (k=1, 2, 5, 10, ...)
Standard Accuracy (equivalent to Hits@1)
Recovery Rate (from 'unsure' responses)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Large gap between standard accuracy (Hits@1) and latent knowledge (Hits@5) typically observed across models.
DBpedia (Head)	Hits@1 (Accuracy)	17.2	17.2	0.0
DBpedia (Head)	Hits@5	17.2	57.9	+40.7
DBpedia (Head)	Hits@5	70.5	92.1	+21.6
DBpedia (Head)	Hits@50	Not reported in the paper	80	Not reported in the paper

Experiment Figures

Comparison of Accuracy vs Hits@k across different model sizes (LLaMA-2 vs LLaMA-3) on DBPedia.

Breakdown of response types (Correct, Wrong, Uninformative) across Head, Torso, and Tail splits of DBpedia.

Main Takeaways

Significant 'storage-expression gap': Models consistently score much higher on Hits@k than Hits@1, proving they store knowledge they fail to express.
Newer architectures (LLaMA3) show vastly superior Hits@k scores compared to older ones (LLaMA2) of similar size, suggesting better knowledge compression.
Domain specificity matters: Open-domain (DBPedia) knowledge is retained better than specific domains (IMDB/Goodreads), which are more sensitive to entity popularity.
Unsure responses often mask correct answers: When models say 'unsure', the correct answer is frequently the 2nd or 3rd ranked token, allowing for recovery via filtering.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM token generation (logits, softmax)
Familiarity with standard QA metrics (Accuracy, Exact Match)
Basic knowledge of few-shot prompting

Key Terms

Hits@k: A metric counting a prediction as correct if the ground truth answer appears within the top-k probability tokens generated by the model

logits: Raw, unnormalized scores output by the last layer of a neural network before being converted to probabilities

uninformative tokens: Tokens that do not provide a factual answer, such as 'unsure', empty strings, or stop words

greedy decoding: A generation strategy where the model always selects the single token with the highest probability at each step

hallucination: When an LLM generates information that is factually incorrect or nonsensical

few-shot QA: Providing the model with a few example question-answer pairs in the prompt to guide its generation