Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion

📝 Paper Summary

Modularized RAG pipeline

LLM-based query expansion methods improve zero-shot retrieval primarily when the model has already memorized the target evidence during pre-training, rather than by generating truly hypothetical documents.

Core Problem

LLM-based query expansion (QE) methods assume that generated 'hypothetical' documents help retrieval even if they contain errors, but it is unclear if gains stem from genuine reasoning or simply reproducing memorized training data.

Why it matters:

If QE relies on memorization, it may fail in real-world scenarios requiring niche or novel knowledge not present in the LLM's training corpus
Benchmarks may artificially inflate the perceived effectiveness of QE methods if the test set knowledge has leaked into the model's pre-training data
Understanding this mechanism is crucial for developing robust retrieval systems that handle unknown or evolving information

Concrete Example: For the claim 'Lunt's star is a high-proper motion star in the constellation of Centaurus,' an LLM might generate a document verbatim matching the Wikipedia evidence because it saw it during training. If the claim were about a newly discovered star not in the training set, the LLM would fail to generate useful search terms, causing retrieval failure.

Key Novelty

Knowledge Leakage Hypothesis for Query Expansion

Proposes that LLMs generate 'hypothetical' documents by reproducing memorized text from their pre-training data (knowledge leakage) rather than creating new content.
Uses Natural Language Inference (NLI) to check if generated documents entail the ground-truth evidence, correlating this 'match' with downstream retrieval performance.

Architecture

Contrast between the assumption of 'Hypothetical Documents' and the reality of 'Knowledge Leakage'.

Evaluation Highlights

Retrieval performance consistently improves only when generated documents contain sentences entailed by gold evidence (statistically significant at p < 0.001)
When generated documents do NOT match gold evidence (unmatched), performance is often worse than simple baseline retrieval (e.g., BM25) without expansion
High rates of potential leakage observed: up to 83.5% of FEVER claims resulted in generated documents containing gold evidence when using GPT-4o-mini

Breakthrough Assessment

7/10

Provides critical empirical evidence challenging the 'hypothetical' nature of HyDE/Query2doc. While it doesn't propose a new method, it fundamentally changes how we interpret QE success on standard benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Fact verification consisting of evidence retrieval from a knowledge store K and verdict prediction for a claim c

Inputs: A textual claim c

Outputs: A veracity label (supported, refuted, etc.) and a retrieved evidence set

Pipeline Flow

Query Expansion: LLM generates document d based on query q
Retrieval: Search system uses expanded query (q + d) to find evidence E
Analysis (Paper specific): NLI checks if d entails gold evidence

System Modules

Generator

Generate hypothetical document or pseudo-document based on the claim

Model or implementation: Various LLMs (GPT-4o-mini, GPT-3.5-Turbo, Llama-3-8B-Instruct, etc.)

Retriever

Retrieve evidence from knowledge store using expanded query

Model or implementation: BM25 (for Query2doc) or Contriever (for HyDE)

Matcher

Determine if generated document entails gold evidence (for analysis, not inference)

Model or implementation: GPT-4o-mini acting as NLI judge

Novel Architectural Elements

NLI-based leakage detection pipeline: Validates query expansion quality by checking entailment against ground truth evidence

Modeling

Base Model: Evaluated multiple backbones: GPT-4o-mini, GPT-3.5-Turbo, Gemini-1.5-flash, Llama-3-8B-Instruct, Llama-3-70B-Instruct, Mixtral-8x7B-Instruct, Gemma-2-9B-It

Comparison to Prior Work

vs. HyDE/Query2doc: This paper is an analysis of WHY these methods work, rather than a new method itself. It shows their gains are conditional on knowledge leakage.
vs. General Leakage Studies (e.g., constructing cloze tests): Specifically targets the query expansion mechanism in retrieval pipelines rather than general QA or generation leakage.

Limitations

No causal link established: shows correlation between entailment and performance, but doesn't prove training data caused the generation
Limited to fact verification: findings might not generalize to other retrieval tasks like open-domain QA
Reliance on automated NLI: uses GPT-4o-mini for judging entailment, which may have its own biases (though validated on a small sample)

📊 Experiments & Results

Evaluation Setup

Zero-shot fact verification using external knowledge stores (Wikipedia, etc.)

Benchmarks:

FEVER (Fact Verification)
SciFact (Scientific Claim Verification)
AVeriTeC (Real-world Fact Checking)

Metrics:

Recall@5
NDCG@5
Recall@10
NDCG@10
Macro F1 (Verdict Prediction)
METEOR (for AVeriTeC)
BERTScore (for AVeriTeC)
Statistical methodology: Mann–Whitney U test

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison shows QE methods (Query2doc, HyDE) generally outperform baselines across benchmarks.
FEVER	Recall@5	47.76	69.15	+21.39
SciFact	NDCG@5	44.60	59.28	+14.68
Leakage analysis reveals that performance gains depend on whether the generated document entails the gold evidence ('Matched' vs 'Unmatched').
FEVER	Recall@5	48.24	73.54	+25.30
SciFact	Recall@5	40.23	89.26	+49.03
AVeriTeC	METEOR	30.14	36.85	+6.71

Main Takeaways

QE methods (HyDE, Query2doc) are highly effective on average but gains are driven by 'Matched' claims where the LLM reproduces gold evidence.
For 'Unmatched' claims (where LLM fails to reproduce evidence), QE performance often drops below simple baselines like BM25 or Contriever.
A significant portion of claims (up to 83.5% on FEVER) results in generated documents that entail gold evidence, suggesting high leakage.
Findings hold across 7 different LLMs (open and closed) and 3 datasets, indicating a systemic issue with evaluating QE on standard benchmarks.

📚 Prerequisite Knowledge

Prerequisites

Understanding of zero-shot retrieval and query expansion
Familiarity with Natural Language Inference (NLI) concepts (entailment vs. contradiction)
Basic knowledge of fact verification tasks (FEVER, SciFact)

Key Terms

Query Expansion (QE): Technique to improve search results by adding relevant terms or generating pseudo-documents to enrich the original query

HyDE: Hypothetical Document Embeddings—a method where an LLM generates a fake document to answer a query, which is then used to search for real documents

NLI: Natural Language Inference—the task of determining if one sentence (hypothesis) logically follows from another (premise)

Gold Evidence: The ground-truth sentences or documents required to verify a claim in a dataset

BM25: A standard probabilistic information retrieval function used to rank documents based on keyword matching

Contriever: A dense retrieval model trained via contrastive learning to match queries and documents in vector space

Knowledge Leakage: When a model performs well on a test set because it has seen the test data (or related information) during its pre-training

Recall@k: The proportion of relevant documents retrieved within the top k results

NDCG@k: Normalized Discounted Cumulative Gain—a metric measuring the quality of ranking, giving more weight to relevant items appearing earlier

Fact Verification: The task of assessing whether a claim is true or false based on retrieved evidence