Towards Reliable Latent Knowledge Estimation in LLMs: Zero-Prompt Many-Shot Based Factual Knowledge Extraction

📝 Paper Summary

Latent Knowledge Estimation (LKE) Hallucination suppression

ZP-LKE (Zero-Prompt Latent Knowledge Estimator) estimates an LLM's factual knowledge by using lists of example facts to implicitly communicate the relationship, avoiding the reliability issues of prompt engineering.

Core Problem

Existing methods for estimating whether an LLM knows a fact rely on prompt templates (e.g., 'X was born in...'), which conflate the model's actual knowledge with its ability to understand specific instructions (meta-linguistic judgment) and are vulnerable to prompt hacking.

Why it matters:

Reliable estimation is crucial to prevent hallucinations; models should not assert facts they do not actually 'know'
Current prompt-based methods introduce side-channels (giving hints) and over-fitting, making estimates unreliable
Model-specific prompt engineering prevents fair large-scale comparisons across different LLM families

Concrete Example: For the relation 'position held', a prompt like 'X is elected Y' performs better than 'X has the position of Y', but the former introduces a side-channel by implicitly ruling out non-elected positions, inflating the knowledge estimate artificially.

Key Novelty

Zero-Prompt Latent Knowledge Estimator (ZP-LKE)

Eliminates explicit instructions entirely; instead of asking 'When was Einstein born?', it provides a sequence like 'Feynman 1918 Heisenberg 1901 Einstein' and checks the completion
Leverages 'task learning' (inferring the relationship from examples) rather than just 'task recognition' (using examples to format an answer), requiring many-shot examples to work effectively

Architecture

Contrast between Prompt-based LKE (Zero-shot and Few-shot) and the proposed Zero-Prompt LKE (ZP-LKE).

Evaluation Highlights

ZP-LKE improves accuracy of extracted facts by +35% over Human-Generated Prompts (0.45 to 0.61) and +90% over Machine-Mined Prompts (0.32 to 0.61) on T-Rex dataset using response testing
In multiple-choice testing, ZP-LKE outperforms baselines by 9.4% for Human-Generated Prompts (0.71 to 0.78) and 57% for Machine-Mined Prompts (0.50 to 0.78)
For complex relations like 'position played on team', ZP-LKE improves extraction by 152% (0.17 to 0.43) over human prompts

Breakthrough Assessment

8/10

Significantly improves reliability of knowledge probing by removing prompt engineering artifacts. Offers a standardized, model-agnostic way to benchmark latent knowledge across diverse LLMs.

⚙️ Technical Details

Problem Definition

Setting: Given a fact triplet f = <x, r, y> (subject, relation, object), determine if the LLM latently stores this fact.

Inputs: A subject x and a relation r (implicitly defined by a set of example triplets)

Outputs: Probability or generation of the object y

Pipeline Flow

Example Selection: Retrieve n example facts sharing relation r from dataset
Prompt Construction: Concatenate subject-object pairs (e.g., 'x1 y1 x2 y2...') ending with target subject x
Inference: Feed sequence to LLM
Evaluation: Check completion for target object y (Response Test) or compare probabilities of candidate objects (Multiple-Choice Test)

System Modules

Prompt Constructor

Create the zero-prompt input string from training facts

Model or implementation: Deterministic algorithm

LLM Inference

Generate completion or compute token probabilities

Model or implementation: Various open-source LLMs (Llama-2, Mistral, Gemma, OPT, Pythia)

Evaluator

Determine if the fact is 'known' based on output

Model or implementation: Algorithmic comparator

Novel Architectural Elements

Zero-prompt input structure: Strictly sequence completion of <x, y> pairs without any natural language instructions or relation labels

Modeling

Base Model: Evaluated on 49 models including Llama-2-70b, Mistral-7B, Gemma-7B, OPT-66B, Pythia-12B

Compute: Not reported in the paper (Inference-only evaluation)

Comparison to Prior Work

vs. LAMA: ZP-LKE uses no template instructions, relying solely on ICL examples
vs. LPAQA: ZP-LKE avoids optimization of prompt templates which can overfit or introduce side-channels
vs. AutoPrompt: ZP-LKE is interpretable (uses actual facts) rather than opaque optimized tokens and requires no gradient access/training
+ 1 more
vs. Few-Shot Prompting (FS-LKE): ZP-LKE uses examples to communicate the *task* (what relation is being tested), whereas FS-LKE uses examples primarily to format the output [Conceptually distinct usage of ICL]

Limitations

Vulnerable to incorrect examples in the context window (performance drops significantly)
Requires 'many-shots' (large context) to work effectively, unlike few-shot prompting which works with fewer examples
Performance depends on the specific choice and ordering of in-context examples (though less so than prompt wording)

Reproducibility

Code: https://github.com/QinyuanWu0710/ZeroPrompt_LKE

Code publicly available at https://github.com/QinyuanWu0710/ZeroPrompt_LKE. Data drawn from T-Rex and Wikidata (public). Exact hyperparameters for example selection and specific random seeds for distractors not fully detailed in text but likely in code.

📊 Experiments & Results

Evaluation Setup

Probing factual knowledge using T-Rex dataset relations

Benchmarks:

T-Rex / Wikidata (Knowledge Graph Triplets Completion)

Metrics:

Accuracy (exact match in generation)
Accuracy (multiple-choice probability ranking)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of ZP-LKE against Human-Generated Prompts (HGP) and Machine-Mined Prompts (MMP) using generation-based testing (P@1).
T-Rex (Generation)	Accuracy	0.45	0.61	+0.16
T-Rex (Generation)	Accuracy	0.32	0.61	+0.29
Comparison using Multiple-Choice testing (ranking 100 options), which isolates knowledge from formatting issues.
T-Rex (Multiple Choice)	Accuracy	0.71	0.78	+0.07
T-Rex (Multiple Choice)	Accuracy	0.50	0.78	+0.28
T-Rex (Multiple Choice)	Accuracy	0.17	0.43	+0.26

Main Takeaways

ZP-LKE consistently surfaces more latent knowledge than prompt-based methods, suggesting previous methods underestimated model knowledge due to poor prompt comprehension.
Instruction fine-tuning tends to reduce the amount of extractable factual knowledge compared to base models.
Larger models and models from stronger families (Llama-2, Mistral) consistently encode more factual knowledge.
ZP-LKE relies on 'task learning' (inferring the relation from many examples) rather than 'task recognition', evidenced by its sensitivity to example correctness.

📚 Prerequisite Knowledge

Prerequisites

Understanding of In-Context Learning (ICL) in LLMs
Familiarity with Knowledge Graphs and triplet representation <subject, relation, object>
Basic concepts of prompt engineering (zero-shot vs. few-shot)

Key Terms

ZP-LKE: Zero-Prompt Latent Knowledge Estimator—the proposed method that uses only lists of example facts to probe knowledge, without instructional text

LKE: Latent Knowledge Estimator—methods attempting to determine what facts are stored inside an LLM's parameters

meta-linguistic judgment: The ability of an LLM to understand instructions about language usage (e.g., 'answer in JSON format'), which often interferes with pure knowledge retrieval

side-channel: Information unintentionally leaked to the model via the prompt template (e.g., using 'elected' implies a political role), making the task easier than intended

T-Rex: A large-scale dataset of facts aligned with Wikipedia triplets, used here for benchmarking knowledge estimation

Wikidata: A structured knowledge base used as the source of ground truth facts

ICL: In-Context Learning—the ability of LLMs to learn tasks from examples provided in the prompt without parameter updates

HGP: Human-Generated Prompts—manual templates written by humans to query the model (e.g., 'X was born in Y')

MMP: Machine-Mined Prompts—templates automatically discovered from large corpora (e.g., Wikipedia) that frequently co-occur with the fact triplets

FS-LKE: Few-Shot Latent Knowledge Estimator—using prompt templates combined with a few examples

ZS-LKE: Zero-Shot Latent Knowledge Estimator—using prompt templates without any examples