Inside-Out: Hidden Factual Knowledge in LLMs

📝 Paper Summary

Knowledge internalization Hallucination suppression

LLMs encode significantly more factual knowledge in their internal representations than they express in their generated outputs, sometimes perfectly knowing answers they fail to generate even once.

Core Problem

It is unclear whether LLMs store more factual information than they express, and relying on generated outputs (external knowledge) may underestimate the model's true capabilities.

Why it matters:

Undisclosed knowledge poses safety risks if sensitive information surfaces unexpectedly
Knowing that models possess unused knowledge suggests performance could be improved by surfacing it rather than retraining
Current evaluation methods based on single generated answers fail to distinguish between lack of knowledge and failure to retrieve it

Concrete Example: A model asked 'Which company is Volvo B58 produced by?' fails to generate the correct answer ('Volvo Buses') in 1,000 attempts, yet an internal probe correctly ranks 'Volvo Buses' higher than the incorrect generated answer 'BMW Group'.

Key Novelty

Formal framework for 'Hidden Knowledge'

Defines knowledge as the ability to rank correct answer candidates higher than incorrect ones across all pairs
Distinguishes between 'external knowledge' (ranking via token probabilities) and 'internal knowledge' (ranking via probing classifiers on hidden states)
Quantifies 'hidden knowledge' as the gap where internal ranking accuracy significantly exceeds external ranking accuracy

Architecture

The conceptual framework for measuring hidden knowledge via pairwise ranking.

Evaluation Highlights

Internal knowledge scores consistently exceed external scores across Llama-3, Mistral, and Gemma-2, with an average relative gap of 40%
In 9% of questions, models perfectly know the correct answer (internal ranking is perfect) despite failing to generate it even once in 1,000 samples
Using internal probes to select from sampled answers improves QA accuracy by 12% over greedy decoding, with potential for 52% improvement if generation constraints were removed

Breakthrough Assessment

8/10

Provides the first formal definition and quantification of hidden knowledge, revealing a fundamental 'tip-of-the-tongue' limitation in LLM generation capabilities.

⚙️ Technical Details

Problem Definition

Setting: Closed-book Question Answering (QA) assessing factual triplets (subject, relation, object)

Inputs: Natural language question q derived from a factual triplet

Outputs: A score K quantifying the ability to rank correct answers a above incorrect answers ã

Pipeline Flow

Answer Sampling (Generate 1,000 candidates + Gold Answer)
Judge Annotation (Label candidates as correct/incorrect)
Scoring (Apply Internal vs. External functions)
Ranking Evaluation (Calculate K scores)

System Modules

Generator

Generate candidate answers for a given question

Model or implementation: Llama-3-8B-Instruct / Mistral-7B-Instruct / Gemma-2-9B-Instruct / Qwen3-32B

LLM Judge

Determine if a generated answer is semantically correct relative to the ground truth

Model or implementation: Gemini 1.5 Pro

Probing Classifier (Internal Scorer)

Predict correctness probability based on internal hidden states

Model or implementation: Linear Logistic Regression Classifier

Novel Architectural Elements

Comparative framework assessing 'Hidden Knowledge' by contrasting ranking performance of internal probes vs. external token probabilities on the exact same set of answer candidates

Modeling

Base Model: Llama-3-8B-Instruct, Mistral-7B-Instruct, Gemma-2-9B-Instruct, Qwen3-32B

Training Method: Linear Probing (Logistic Regression) on frozen LLM states

Objective Functions:

Purpose: Train a classifier to distinguish correct/incorrect answers from hidden states.

Formally: Minimize logistic regression loss on labeled (q, a) pairs.

Trainable Parameters: Linear probe weights only (LLM is frozen)

Training Data:

EntityQuestions dataset (derived from Wikidata)
Train/Dev/Test splits: 500 questions per relation
Relations: Spouse, Manufacturer, Record Label, Author

Compute: Not reported in the paper

Comparison to Prior Work

vs. Burns et al.: Defines knowledge via pairwise ranking of answers to a *specific* question rather than classifying independent statements
vs. Kadavath et al.: Compares P(True) (external) against internal probes, treating P(True) as a baseline rather than the gold standard
vs. Orgad et al.: Systematically quantifies the *gap* (hidden knowledge) and identifying cases where correct answers are never generated but perfectly known

Limitations

High computational cost due to sampling 1,000 answers per question and running multiple scoring methods
Definition of knowledge is restricted to single-fact triplets and does not account for related facts
K* metric (perfect knowledge) is sensitive to labeling errors by the LLM judge
Scope limited to 7B-32B parameter models due to compute constraints

Reproducibility

Code: https://github.com/zorikg/inside-out

publicly available (https://github.com/zorikg/inside-out). Code available. Uses open weights models (Llama-3, Mistral, Gemma, Qwen). Dataset derived from EntityQuestions. Gemini 1.5 Pro used as judge (closed API).

📊 Experiments & Results

Evaluation Setup

Closed-book QA on Wikidata relations

Benchmarks:

EntityQuestions (Single-hop factual QA)

Metrics:

K score (Equation 1: fraction of correctly ranked answer pairs)
K* score (Equation 3: binary indicator of perfect ranking)
QA Accuracy (Top-1 selection)
Statistical methodology: Paired t-test with p-value < 0.05

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Internal knowledge (Probe) consistently exceeds external knowledge (P(a\|q), Pnorm, P(True)) across all models.
EntityQuestions	Relative Gap in K score	Varies by model	Probe	+57% (Gemma)
Models often have 'perfect knowledge' (K*=1) internally even when they fail to generate the correct answer.
EntityQuestions	Frequency of Perfect Knowledge with Zero Generation	0	0.072	+0.072
Using internal probes for test-time selection improves QA accuracy.
EntityQuestions (Avg across models)	Accuracy	21.2	23.7	+2.5
EntityQuestions (Avg across models)	Accuracy	21.2	32.0	+10.8

Experiment Figures

Average K (knowledge) scores for different scoring functions across Llama-3, Mistral, and Gemma-2.

Average K* (perfect knowledge) scores comparing scenarios with vs. without manually adding the gold answer.

Main Takeaways

LLMs consistently encode more factual knowledge internally than they express externally, with the gap varying by model (e.g., small for Llama-3, large for Gemma-2).
External verification methods like P(True) outperform generation likelihood P(a|q), confirming models are better at verifying than generating.
A fundamental 'tip-of-the-tongue' phenomenon exists: models can perfectly rank a correct answer above all distractors yet fail to generate it in 1,000 attempts.
Scaling test-time compute via sampling is limited by the generation bottleneck; significant gains are inaccessible because correct answers are simply never sampled.

📚 Prerequisite Knowledge

Prerequisites

Understanding of LLM token probabilities and decoding strategies
Familiarity with linear probing classifiers
Basic knowledge of ROC/AUC concepts for ranking

Key Terms

hidden knowledge: Factual information encoded in a model's parameters that is not expressed in its generated outputs (external signals)

external knowledge: Knowledge measured using observable signals like token-level probabilities (e.g., P(a|q))

internal knowledge: Knowledge measured using intermediate computations, such as hidden state representations accessed via a probe

K score: A metric quantifying knowledge as the fraction of (correct, incorrect) answer pairs where the correct one is ranked higher by a scoring function

K* score: A binary metric indicating 'perfect knowledge', where the model ranks every correct answer higher than every incorrect answer for a specific question

probing classifier: A simple linear model trained on LLM hidden states to predict properties (here, correctness) of the input

tip of the tongue: A cognitive state where a subject knows a fact but cannot retrieve or produce the word; applied here to LLMs knowing an answer but failing to generate it

greedy decoding: A generation strategy where the model always selects the highest-probability token at each step

LLM judge: Using a strong language model to evaluate the correctness of answers generated by another model

SFT: Supervised Fine-Tuning—training a model on labeled examples

inference scaling: Improving model performance at test time by using more compute, typically via sampling multiple outputs and selecting the best one