FiD: Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

📝 Paper Summary

Modularized RAG pipeline Open Domain Question Answering

Fusion-in-Decoder (FiD) processes retrieved passages independently in the encoder but jointly in the decoder, allowing generative models to scale efficiently to large numbers of supporting documents.

Core Problem

Existing generative approaches for open-domain QA require massive parameters to store knowledge, while extractive approaches struggle to aggregate evidence from multiple passages effectively.

Why it matters:

Purely generative models (like GPT-3) are expensive to train and query because knowledge must be stored in weights
Extractive models (like DrQA) often fail to combine information distributed across multiple documents
Previous retrieval-augmented generative models were computationally expensive (quadratic complexity) when scaling to many retrieved passages

Concrete Example: When answering 'Where was Alan Turing born?', a standard model might need to read 100 passages. If it concatenates them all into one long input, the self-attention mechanism becomes prohibitively slow (quadratic cost), limiting how much evidence can be used.

Key Novelty

Fusion-in-Decoder (FiD)

Encodes each retrieved passage (paired with the question) independently to keep computational cost linear with the number of passages
Fuses the information only during the decoding step, where the decoder attends to the concatenation of all encoder outputs jointly to generate the answer

Architecture

The architecture of the Fusion-in-Decoder method.

Evaluation Highlights

Achieves 51.4% Exact Match on NaturalQuestions with the large model, outperforming RAG (44.5%) and T5 (36.6%)
Achieves 67.6% Exact Match on TriviaQA, surpassing state-of-the-art extractive and generative baselines
Scaling the number of retrieved passages from 10 to 100 leads to significant performance gains (+6% on TriviaQA), unlike extractive models which plateau early

Breakthrough Assessment

9/10

This paper introduced the standard architecture for high-performance generative reader models. FiD became the default baseline for evidence fusion in RAG due to its simplicity and scalability.

⚙️ Technical Details

Problem Definition

Setting: Open-domain question answering where the evidence is not given as input

Inputs: A natural language question q and a set of retrieved support passages

Outputs: A generated answer string (text)

Pipeline Flow

Retrieval: Fetch 100 support passages using DPR or BM25
Encoding: Process each (Question + Passage) pair independently
Decoding: Generate answer by attending to all encoder outputs

System Modules

Retriever

Retrieve relevant support passages from Wikipedia

Model or implementation: DPR (for NQ/TriviaQA) or BM25 (for SQuAD)

Independent Encoder (Reading)

Create vector representations for each passage conditioned on the question

Model or implementation: T5 Encoder (base or large)

Joint Decoder (Reading)

Generate the answer by aggregating evidence from all encoded passages

Model or implementation: T5 Decoder (base or large)

Novel Architectural Elements

Decoupled encoding: Encoding N contexts independently rather than as one long sequence or with cross-attention
Fusion-in-Decoder mechanism: Aggregating evidence solely through the decoder's cross-attention over concatenated encoder states

Modeling

Base Model: T5 (Text-to-Text Transfer Transformer)

Training Method: Supervised fine-tuning

Objective Functions:

Purpose: Maximize likelihood of the correct answer.

Formally: Standard sequence-to-sequence cross-entropy loss.

Training Data:

NaturalQuestions, TriviaQA, SQuAD v1.1
Wikipedia dumps (Dec 2018 or Dec 2016 depending on dataset)

Key Hyperparameters:

learning_rate: 1e-4
batch_size: 64
optimizer: Adam
+ 4 more
dropout_rate: 10%
training_steps: 10,000
passages_count: 100
context_length: 250 word pieces

Compute: 64 Tesla V100 32Gb GPUs (approx 425 GPU hours for training on 100 passages)

Comparison to Prior Work

vs. RAG: RAG marginalizes over passages or feeds them singly; FiD processes them jointly in decoder allowing more context
vs. REALM: REALM focuses on pre-training; FiD focuses on scaling the number of contexts during fine-tuning/inference
vs. Extractive models (DPR, DrQA): FiD generates answers rather than extracting spans, allowing it to combine evidence
+ 1 more
vs. GPT-3 [not cited in paper]: GPT-3 relies on implicit knowledge in weights; FiD uses explicit retrieval to reduce parameter count while maintaining accuracy

Limitations

Computationally expensive inference compared to extractive models due to processing 100 passages
Requires significant GPU memory to store gradients for 100 contexts during training
Relies on the quality of the external retrieval system (DPR/BM25)
Greedy decoding used for generation (beam search might improve results but adds cost)

Reproducibility

Code: https://github.com/facebookresearch/fid

publicly available (https://github.com/facebookresearch/fid). Code relies on HuggingFace Transformers. Pre-trained T5 weights are standard. DPR and BM25 retrievers use FAISS and Apache Lucene respectively.

📊 Experiments & Results

Evaluation Setup

Open-domain QA using Wikipedia as the knowledge source

Benchmarks:

NaturalQuestions (Open Domain QA)
TriviaQA (Open Domain QA)
SQuAD v1.1 (Reading Comprehension (adapted to Open Domain))

Metrics:

Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison against state-of-the-art methods on NaturalQuestions and TriviaQA.
NaturalQuestions	EM	44.5	51.4	+6.9
NaturalQuestions	EM	36.6	51.4	+14.8
TriviaQA	EM	56.1	67.6	+11.5
TriviaQA	EM	57.9	67.6	+9.7
Ablation on scaling the number of retrieved passages shows consistent improvement for FiD.
TriviaQA	EM	61.1	67.6	+6.5

Main Takeaways

Generative models excel at aggregating evidence: Performance improves significantly as the number of retrieved passages increases (up to 100), whereas extractive models often plateau.
Simplicity wins: Processing passages independently in the encoder allows linear scaling, avoiding the quadratic complexity of full self-attention over concatenated contexts.
Retrieval outperforms model scale: A 770M parameter model with retrieval outperforms an 11B parameter closed-book model (T5) by a large margin on NaturalQuestions.
Training efficiency: Fine-tuning on 100 passages for a short period after training on fewer passages captures most of the performance gains with significantly less compute.

📚 Prerequisite Knowledge

Prerequisites

Sequence-to-sequence (Seq2Seq) models
Transformer architecture (Encoder-Decoder)
Retrieval-Augmented Generation concepts

Key Terms

Fusion-in-Decoder: An architecture where passages are encoded separately but attended to jointly by the decoder

DPR: Dense Passage Retrieval—a method using dual BERT encoders to retrieve relevant documents based on semantic similarity

BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query

Exact Match: An evaluation metric that counts a prediction as correct only if it matches one of the ground truth answers exactly after normalization

T5: Text-to-Text Transfer Transformer—a pre-trained language model that treats every NLP problem as a text generation task

Open Domain QA: Answering questions using a large collection of documents (like Wikipedia) without knowing in advance which specific document contains the answer

Extractive models: QA systems that select a specific span of text from a document as the answer

Generative models: QA systems that generate new text for the answer, potentially synthesizing information