REPLUG: Retrieval-Augmented Black-Box Language Models

📝 Paper Summary

Modularized RAG pipeline Black-box LLM augmentation

REPLUG enhances frozen, black-box language models by prepending retrieved documents via an ensemble strategy and tuning the retriever using the LLM's own output probabilities as supervision.

Core Problem

Large language models (LLMs) >100B parameters are often only accessible as black boxes via APIs, making traditional white-box retrieval augmentation (which requires accessing internal representations or fine-tuning gradients) impossible.

Why it matters:

State-of-the-art models like GPT-3 and Codex are commercially closed or too expensive to fine-tune (e.g., BLOOM-176B requires 72 A100 GPUs)
Existing retrieval methods like RETRO or kNN-LM require modifying architectures or accessing internal hidden states, which fails for API-based models
Black-box LLMs still suffer from hallucination and outdated knowledge despite their size

Concrete Example: When predicting the continuation for a text about a rare entity like 'Li Bai', a standard GPT-2 model fails because it lacks specific knowledge. REPLUG retrieves a relevant biography, matches the entity name, and improves the probability of the correct next token by 11%.

Key Novelty

REPLUG (Retrieve and Plug) & REPLUG LSR (LM-Supervised Retrieval)

Treats the LLM as a frozen scoring function: The retriever is optimized to find documents that minimize the perplexity of the ground truth text under the black-box LLM
Parallel Ensemble Inference: Instead of concatenating all documents into one long prompt (which hits context limits), documents are processed in parallel passes and predictions are ensembled based on retrieval scores

Architecture

Overview of the REPLUG inference process and the Black-Box assumption

Evaluation Highlights

+6.3% improvement in language modeling (perplexity) for GPT-3 Davinci (175B) on the Pile dataset using REPLUG LSR
+5.1% accuracy improvement on MMLU (Massive Multi-task Language Understanding) for Codex (175B) using 5-shot in-context learning
Achieves state-of-the-art few-shot performance on Natural Questions (45.5%) and TriviaQA (77.3%) using Codex, outperforming the white-box Atlas model trained on 64 examples

Breakthrough Assessment

8/10

First framework to successfully apply retrieval augmentation to >100B black-box models with retriever tuning, demonstrating that frozen LLMs can supervise their own retrieval modules.

⚙️ Technical Details

Problem Definition

Setting: Retrieval-augmented language modeling where the generator (LM) is a black box (parameters θ fixed, only P(y|x) accessible) and the retriever is tuneable

Inputs: Input context sequence x

Outputs: Next token probability distribution P(y|x)

Pipeline Flow

Document Retrieval (Retrieve top-k documents)
Parallel Encoding (Pass input + each document through frozen LM separately)
Ensemble (Weighted average of output probabilities)

System Modules

Retriever

Encode input and documents to find top-k relevant documents

Model or implementation: Contriever (Dual Encoder)

Black-Box LM (Generation)

Compute probability of next token given context prepended with a single retrieved document

Model or implementation: GPT-3 (Curie/Davinci), Codex, etc. (Frozen)

Ensemble Mechanism (Generation)

Combine probabilities from k passes into final prediction

Model or implementation: Weighted Average

Novel Architectural Elements

Parallel-context ensemble inference: Encoding retrieved documents in independent parallel passes rather than a single long context window to bypass length limits
LM-supervised retriever update loop: Using the frozen black-box LM's perplexity reduction as a supervision signal to update the dense retriever

Modeling

Base Model: Varies (GPT-3 175B, Codex 175B, GPT-2, OPT, BLOOM)

Training Method: KL Divergence minimization between Retriever distribution and LM-likelihood distribution

Objective Functions:

Purpose: Compute probability of retrieving a document based on similarity.

Formally: P_R(d|x) = exp(s(d,x)/γ) / Σ exp(s(d',x)/γ)
Purpose: Compute probability of a document being 'good' based on how much it helps the LM predict the ground truth.

Formally: Q(d|x,y) = exp(P_LM(y|d,x)/β) / Σ exp(P_LM(y|d',x)/β)
Purpose: Minimize difference between retrieval probability and LM preference.

Formally: L = KL(P_R(d|x) || Q(d|x,y))

Adaptation: Fine-tuning the Contriever (dense retriever) only; LM is frozen

Trainable Parameters: Retriever parameters only

Training Data:

800K sequences from Pile training data
Queries: first 128 tokens; Ground truth: last 128 tokens
Retrieval corpus: 36M documents from Pile

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 64
retrieval_temperature_gamma: Not explicitly reported in the paper (implied parameter of softmax)
+ 4 more
lm_temperature_beta: 0.1
training_steps: 25k
index_update_frequency: Every 3k steps
k_retrieved_documents: 20 for training, 10 for inference

Compute: Not reported in the paper

Comparison to Prior Work

vs. Atlas: REPLUG keeps the LM frozen and black-box, whereas Atlas fine-tunes the LM
vs. RETRO: REPLUG requires no architectural changes or pre-training; works with existing APIs
vs. kNN-LM: REPLUG does not require access to internal vector representations (hidden states), only output probabilities
+ 1 more
vs. DSP (Demonstrate-Search-Predict) [not cited in paper]: DSP focuses on multi-hop reasoning via sophisticated prompting pipelines, while REPLUG focuses on improving base perplexity and single-step QA via ensemble probabilities

Limitations

Computational overhead: Requires running the heavy LM k times (once per document) for every inference step, increasing cost and latency linearly with k.
Context window bottleneck: Although ensemble helps, each individual pass is still limited by the LM's context window size.
Lack of interpretability: It is unclear when the model relies on retrieved knowledge versus parametric knowledge.

Reproducibility

Code availability is not provided. The paper uses public datasets (Pile, MMLU, NQ, TriviaQA) and publicly available APIs/models (GPT-3, Codex). The training relies on calling the GPT-3 Curie API as a supervision signal, which has associated costs.

📊 Experiments & Results

Evaluation Setup

Language Modeling (perplexity) and Downstream Tasks (QA, MMLU)

Benchmarks:

The Pile (Language Modeling)
MMLU (Massive Multi-task Language Understanding)
Natural Questions (NQ) (Open Domain QA)
TriviaQA (TQA) (Open Domain QA)

Metrics:

BPB (Bits per byte)
Accuracy (5-shot for MMLU)
Exact Match (Few-shot for QA)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Language modeling results on The Pile showing improvements across GPT-3 sizes.
The Pile	BPB	0.80	0.75	-0.05
The Pile	BPB	0.88	0.82	-0.06
MMLU Benchmark results comparing Codex with REPLUG variants against baselines.
MMLU	Accuracy	68.3	71.8	+3.5
Open Domain QA results (Few-shot setting).
Natural Questions	Exact Match	40.6	45.5	+4.9
TriviaQA	Exact Match	73.6	77.3	+3.7

Experiment Figures

The REPLUG LSR training process

Perplexity improvement vs Model Size for GPT-2, BLOOM, and OPT

Main Takeaways

REPLUG LSR consistently outperforms standard REPLUG (frozen retriever), proving that using the black-box LM to supervise the retriever is effective.
The approach scales well to very large models (>100B parameters) and across different families (GPT-3, OPT, BLOOM), which was previously difficult for retrieval augmentation.
Ensembling retrieved documents significantly outperforms ensembling random documents, confirming the value of the retrieval mechanism.
Analysis shows retrieval benefits rare entities (long-tail knowledge) most significantly.

📚 Prerequisite Knowledge

Prerequisites

Language Modeling (perplexity, next-token prediction)
Dense Retrieval (dual-encoders, embeddings)
KL Divergence (for loss calculation)
In-context learning (few-shot prompting)

Key Terms

REPLUG: Retrieve and Plug—the proposed framework for augmenting black-box LMs with retrieved documents

REPLUG LSR: REPLUG with LM-Supervised Retrieval—the training scheme where the retriever is tuned to minimize the black-box LM's perplexity

Contriever: A specific dense information retrieval model based on contrastive learning, used as the base retriever

Perplexity: A measurement of how well a probability model predicts a sample; lower is better

KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution is different from a second, reference probability distribution

FAISS: Facebook AI Similarity Search—a library for efficient similarity search and clustering of dense vectors

Dual Encoder: A retrieval architecture that uses two separate encoders (often sharing weights) to embed queries and documents into the same vector space

MMLU: Massive Multi-task Language Understanding—a benchmark covering 57 tasks including STEM, humanities, and social sciences

Zero-shot/Few-shot: Evaluating a model with no (zero) or very few (few) examples in the prompt