Towards Better Generalization in Open-Domain Question Answering by Mitigating Context Memorization

📝 Paper Summary

Modularized RAG pipeline Factuality and hallucination in LLMs

The paper introduces Corpus-Invariant Tuning (CIT), a training loss that prevents the reader from over-memorizing retrieved documents, thereby forcing reliance on retrieval and improving generalization to new or updated corpora.

Core Problem

Retrieval-augmented models often over-memorize training documents into their parameters instead of relying on the retriever, causing failures when the external corpus is updated or the domain changes.

Why it matters:

Real-world knowledge evolves continually (e.g., Wikipedia updates), but models trained on old data struggle to adapt even when provided with new retrieved documents due to parametric bias.
Adapting to new domains (e.g., biomedical) usually requires extensive retraining if the model has hard-coded general-domain knowledge.
Prior OpenQA methods focus on in-domain accuracy but neglect the 'reader's' reliance on retrieved context versus internal memory.

Concrete Example: If a model memorizes 'Boris Johnson' as the UK PM from an old corpus, it may ignore a retrieved document stating 'Rishi Sunak is PM' from a new corpus. The paper shows a drop from 62.2 (fresh training) to 56.9 (transfer) when moving from Wiki-2017 to Wiki-2018.

Key Novelty

Corpus-Invariant Tuning (CIT)

Introduces a regularization loss that prevents the reader's likelihood of generating the retrieved document text from increasing during training.
Forces the reader to act as a reasoning module over inputs rather than a memory module, shifting the burden of knowledge storage back to the retriever.

Architecture

Illustration of the Corpus-Invariant Tuning (CIT) strategy.

Evaluation Highlights

+2.1% Exact Match (EM) improvement on Natural Questions when transferring from Wiki-2017 to Wiki-2018 compared to the Atlas-XL baseline.
Achieves state-of-the-art F1 scores among 3B-sized models on the RobustQA benchmark (averaging across 8 domains) by mitigating over-memorization.
Significant gains in 'Life' domain generalization (where overlap with Wikipedia is high), validating that reducing memorization helps when knowledge conflicts or updates occur.

Breakthrough Assessment

7/10

Addresses a critical but often overlooked issue in RAG (parametric vs. non-parametric conflict). The solution is simple and effective, though primarily demonstrated on standard benchmarks.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering (OpenQA) using a retrieval-augmented generation model

Inputs: Natural language question x and a large-scale external knowledge corpus Z

Outputs: Predicted answer y_hat

Pipeline Flow

Retriever: Selects relevant documents C from corpus Z given question x
Reader: Generates answer y using x and C (trained with CIT loss)

System Modules

Retriever

Select a subset of documents relevant to the question

Model or implementation: Contriever

Reader

Generate the answer while minimizing memorization of the document text itself

Model or implementation: Atlas-XL (T5-3B based Fusion-in-Decoder)

Novel Architectural Elements

Modification of the training objective to include a 'Corpus-Invariant' loss term that penalizes improvements in the reader's likelihood of generating retrieved document spans

Modeling

Base Model: Atlas-XL (3B parameters, based on T5)

Training Method: Multi-task learning (QA objective + CIT regularization)

Objective Functions:

Purpose: Minimize standard QA loss.

Formally: L_QA (negative log-likelihood of answer y given x and C).
Purpose: Prevent memorization of retrieved contexts.

Formally: L_CIT = max(0, p_phi(c) - p_phi0(c)), where p_phi(c) is the likelihood of the document under current parameters and p_phi0 under initial parameters.

Key Hyperparameters:

alpha (CIT strength): 0.1 to 0.5 (Grid search, 0.2 best for corpus update, 0.3 best for cross-domain)
optimizer: AdamW
learning_rate: 4e-5 (for NQ)
+ 3 more
batch_size: 64 (Global batch size)
scheduler: linear decay
warmup_ratio: 0.06

Compute: 4 NVIDIA A100 (80GB) GPUs; ~3.5 hours for convergence on one OpenQA dataset

Comparison to Prior Work

vs. Atlas-XL: Adds L_CIT loss term to prevent the reader from memorizing document text, forcing reliance on context
vs. RGF/FiD-KD: Specifically targets the generalization failure mode caused by parametric memorization rather than just improving in-domain accuracy

Limitations

CIT introduces a hyperparameter (alpha) that must be tuned manually; optimal values differ between tasks (0.2 for corpus updates vs 0.3 for cross-domain).
Requires computing the likelihood of retrieved documents during training, which adds computational overhead.
Experiments limited to Wikipedia-based and RobustQA domains; not tested on highly specialized technical or code domains.

Reproducibility

Code availability is marked as 'not provided' in the paper text (no URL in abstract/intro). Hyperparameters for grid search are detailed in tables. Training relies on Atlas-XL and Contriever checkpoints.

📊 Experiments & Results

Evaluation Setup

Open-domain QA with retrieval from different corpus versions (Wiki-2017 vs Wiki-2018) and cross-domain transfer (RobustQA).

Benchmarks:

Natural Questions (NQ) (Open-domain QA)
TriviaQA (Open-domain QA)
RobustQA (Cross-domain OpenQA (8 domains))

Metrics:

Exact Match (EM)
F1 score
Cross-domain Relative Performance (CRP)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Generalization across corpus updates (Wiki-2017 -> Wiki-2018). Models trained on 2017 data are tested on 2018 data.
Natural Questions	EM	56.9	59.0	+2.1
Natural Questions	EM	59.5	61.3	+1.8
TriviaQA	EM	73.2	74.8	+1.6
Cross-domain generalization using RobustQA. Models trained on NQ (Wikipedia) are evaluated on 8 diverse domains.
RobustQA (Average)	F1	47.7	49.6	+1.9
RobustQA (Life domain)	F1	48.2	57.3	+9.1

Experiment Figures

Heatmap of Cross-Domain Relative Performance (CRP) improvements for different CIT strengths (alpha).

Line graph showing the sensitivity of generalization performance to the hyperparameter alpha (CIT strength).

Main Takeaways

Corpus-Invariant Tuning (CIT) consistently improves generalization across both temporal corpus updates and domain shifts compared to standard fine-tuning.
The method is particularly effective for domains with high overlap with the pre-training corpus (e.g., Life domain), where unlearning memorized facts is crucial.
Performance gains are achieved without degrading in-domain performance (Original setting results remain comparable or slightly better).
Analysis of 'overlap rate' confirms that CIT reduces the number of errors where the model outputs an outdated memorized answer instead of the retrieved one.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architectures (Retriever-Reader)
Language Modeling objectives (Masked Span Prediction)
Knowledge Distillation / Regularization techniques

Key Terms

CIT: Corpus-Invariant Tuning—a training strategy that adds a loss term to prevent the reader model from becoming better at generating the retrieved documents themselves

OpenQA: Open-domain Question Answering—answering fact-based questions using a large, unstructured text corpus rather than a specific context provided with the question

Atlas: A state-of-the-art retrieval-augmented language model architecture that uses a Contriever for retrieval and Fusion-in-Decoder (FiD) for reading

EM: Exact Match—a metric measuring the percentage of predictions that match the ground truth answer exactly

FiD: Fusion-in-Decoder—a reader architecture that processes retrieved documents independently in the encoder and fuses representations in the decoder

Contriever: A dense information retrieval model trained using contrastive learning

Masked Span Prediction: A pre-training objective where the model predicts masked-out sequences of text, used here as a proxy for the likelihood of the document

CRP: Cross-domain Relative Performance—a metric defined in this paper as the ratio of cross-domain performance to intra-domain performance