Dense X Retrieval: What Retrieval Granularity Should We Use?

📝 Paper Summary

Modularized RAG pipeline Retrieval granularity

Indexing retrieval corpora by 'propositions'—atomic, self-contained factoids—significantly improves dense retrieval and downstream QA performance compared to using fixed-length passages or sentences.

Core Problem

Standard retrieval units like 100-word passages often contain irrelevant details that distract models, while sentences lack necessary context (like coreference resolution) to be understood independently.

Why it matters:

Irrelevant details in retrieved passages consume context window space and can confuse generation models in RAG pipelines
Sentence-level retrieval fails when sentences depend on surrounding text for meaning (e.g., 'He did it' is meaningless without knowing who 'He' is)
Dense retrievers often fail to generalize to new domains when indexed on coarse units that dilute the density of relevant information

Concrete Example: A passage about the Leaning Tower of Pisa contains details about restoration and displacement. A question asks about the 'current lean'. A full passage retrieves irrelevant history. A sentence 'it leans 3.99 degrees' misses the context that 'it' refers to the tower. A proposition 'The Leaning Tower of Pisa currently leans at about 3.99 degrees' is precise and self-contained.

Key Novelty

Proposition-Level Retrieval

Decompose text into 'propositions': atomic, self-contained statements of fact that resolve coreferences (e.g., replacing 'he' with the entity name) and isolate distinct pieces of meaning
Index the corpus at this fine-grained level, allowing the retriever to match queries directly to precise facts rather than noisy passages or context-poor sentences
Use these compact units in RAG prompts to increase the density of relevant information within a fixed token budget

Architecture

Comparison of different retrieval granularities: Passage vs. Sentence vs. Proposition. It illustrates how a passage is split into propositions.

Evaluation Highlights

+12.0 Recall@5 improvement on average across 5 QA datasets using unsupervised SimCSE when indexing by propositions instead of passages
+10.1 average Recall@20 improvement for unsupervised retrievers and +2.7 for supervised retrievers over passage-based indexing
Achieves higher downstream QA accuracy with fewer tokens: GTR with propositions outperforms passages by +2.8 EM@500 on LLaMA-2-7B

Breakthrough Assessment

7/10

Simple yet highly effective intervention at the data indexing level. Strong empirical gains, particularly for unsupervised retrieval and cross-task generalization, without requiring retriever retraining.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering (QA) using dense retrieval over a corpus indexed at different granularities

Inputs: Natural language question q

Outputs: Answer a (for QA task) or set of relevant text units (for retrieval task)

Pipeline Flow

Propositionizer (Offline Data Processing)
Dense Retriever (Inference)
Reader/Generator (Downstream Inference)

System Modules

Propositionizer

Segment and rewrite text into atomic propositions (Offline step)

Model or implementation: Flan-T5-large (fine-tuned on GPT-4 distilled data)

Dense Retriever

Encode query and retrieval units (passages/sentences/propositions) to find best matches

Model or implementation: Various Dual-Encoders (SimCSE, Contriever, DPR, GTR)

Reader / Generator

Generate answer based on retrieved units

Model or implementation: Fusion-in-Decoder (T5-large) OR LLaMA-2-7B (In-context learning)

Novel Architectural Elements

Use of 'Proposition' as the fundamental unit for indexing and retrieval in dense retrieval systems
FactoidWiki dataset construction pipeline involving a specialized 'Propositionizer' model

Modeling

Base Model: Flan-T5-large (for Propositionizer), various BERT/T5 bases (for Retrievers)

Training Method: Two-step distillation for Propositionizer

Adaptation: Fine-tuning Flan-T5-large

Training Data:

Seed set: 42k passages processed by GPT-4 to generate propositions
Fine-tuning set: Seed set used to train Flan-T5-large

Compute: Not reported in the paper

Comparison to Prior Work

vs. DensePhrase: Retrieves complete semantic units (propositions) rather than spans, ensuring self-contained context
vs. ColBERT: Changes the indexing unit itself (data-centric) rather than the retrieval architecture (model-centric)
vs. Sarthi et al.: Focuses on atomic decomposition of existing text rather than abstractive summarization

Limitations

Proposition generation adds an offline computational cost (inference on 6M Wikipedia pages)
Depends on the quality of the Propositionizer; errors in splitting or rewriting could propagate (though analysis showed high faithfulness)
Supervised retrievers (DPR) showed smaller gains or slight regressions on in-domain tasks (NQ, WebQ) compared to unsupervised ones

Reproducibility

Code: https://github.com/chentong0/factoid-wiki

📊 Experiments & Results

Evaluation Setup

Open-domain QA with retrieval from Wikipedia (FactoidWiki)

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
TriviaQA (TQA) (Open-domain QA)
Web Questions (WebQ) (Open-domain QA)
SQuAD (Reading Comprehension / QA)
Entity Questions (EQ) (Entity-centric QA)

Metrics:

Passage Recall@5
Passage Recall@20
Exact Match (EM) @ 100/500 tokens
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Retrieval performance comparison showing the impact of granularity (Passage vs. Sentence vs. Proposition) on Recall@5 across different retrievers.
Average (5 datasets)	Recall@5	34.3	46.3	+12.0
Average (5 datasets)	Recall@5	41.3	50.6	+9.3
EntityQuestions	Recall@5	47.7	59.8	+12.1
Natural Questions	Recall@5	73.5	70.3	-3.2
Downstream QA performance using retrieval-augmented LLaMA-2-7B, comparing the effect of retrieval unit type on Exact Match scores.
Average (5 datasets)	EM @ 500 tokens	42.8	45.6	+2.8
Average (5 datasets)	EM @ 500 tokens	30.3	34.4	+4.1

Experiment Figures

Passage Recall@5 on EntityQuestions dataset plotted against Entity Frequency (log scale).

Recall of the gold answer within the top-l retrieved words (information density analysis).

Main Takeaways

Propositions significantly outperform passages and sentences for unsupervised retrievers (SimCSE, Contriever) and out-of-domain supervised settings (EntityQuestions, SQuAD).
Retrieval by proposition offers better cross-task generalization, particularly for long-tail entities where passage-level context might be noisy.
In downstream RAG, propositions allow for a higher density of relevant information within a fixed context window (e.g., 100-500 tokens), consistently improving QA performance.
The 'Propositionizer' approach decouples the retrieval unit from the training unit, allowing performance gains without retraining the dense retriever.

📚 Prerequisite Knowledge

Prerequisites

Understanding of dense retrieval (dual-encoder architectures)
Familiarity with Retrieval-Augmented Generation (RAG)
Basic knowledge of open-domain QA benchmarks

Key Terms

proposition: An atomic, self-contained expression of a distinct factoid within text, generated to stand alone without surrounding context

FactoidWiki: A processed version of English Wikipedia where each page is segmented into passages, sentences, and propositions for retrieval experiments

dense retrieval: Retrieval method using vector embeddings to match queries and documents, as opposed to keyword matching

Recall@k: The percentage of questions where the correct answer appears in the top-k retrieved documents

EM: Exact Match—a metric measuring if the predicted answer string exactly matches the ground truth

SimCSE: A contrastive learning framework for training sentence embeddings

Contriever: An unsupervised dense retriever trained via contrastive learning

DPR: Dense Passage Retriever—a supervised dual-encoder model trained on QA pairs

GTR: Generalizable T5-based dense retriever

FiD: Fusion-in-Decoder—a method where a model encodes retrieved passages independently and fuses them in the decoder to generate an answer

BM25: A probabilistic retrieval function based on term frequency and inverse document frequency (keyword matching)