In-context RALM: In-Context Retrieval-Augmented Language Models

📝 Paper Summary

Modularized RAG pipeline Knowledge internalization

In-Context RALM demonstrates that off-the-shelf language models can achieve significant performance gains by simply prepending retrieved documents to the input context without any architecture changes or fine-tuning.

Core Problem

Standard language models hallucinate facts and lack source attribution, while existing Retrieval-Augmented Language Modeling (RALM) methods require complex architectural modifications and expensive retraining.

Why it matters:

Modifying LM architectures (like RETRO) complicates deployment and prevents using LMs available only via API
Factual inaccuracies and lack of provenance hinder the adoption of generative AI in high-stakes or knowledge-intensive domains
Retraining large models to accommodate retrieval mechanisms is computationally prohibitive for many practitioners

Concrete Example: A standard LM might hallucinate that 'World Cup 2026 will expand to 48 teams' without evidence. By prepending a retrieved news snippet about the 2026 tournament to the context window, the frozen LM correctly predicts the next tokens based on the ground truth.

Key Novelty

In-Context RALM (Retrieval-Augmented Language Modeling)

Leaves the LM frozen and unmodified, using the standard input context window to 'read' retrieved documents
Utilizes off-the-shelf retrievers (like BM25) to select documents based on the current text generation prefix
Introduces LM-oriented reranking where a model scores retrieved documents based on how well they predict the upcoming text tokens

Architecture

Conceptual diagram of In-Context RALM mechanism

Evaluation Highlights

A 345M parameter GPT-2 with In-Context RALM outperforms a 1.5B parameter GPT-2 (4x larger) on WikiText-103
In-Context RALM with BM25 improves a 6.7B parameter OPT model to match the perplexity of a 66B parameter OPT model (10x larger)
Sparse retrieval (BM25) surprisingly outperforms dense retrievers (Contriever, BERT) for the language modeling task in zero-shot settings

Breakthrough Assessment

8/10

Highly impactful for practitioners because it unlocks RAG capabilities for any black-box LLM without training, demonstrating that simple context prepending rivals complex architectural modifications.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive language modeling conditioned on external documents

Inputs: A sequence of tokens x_1, ..., x_n

Outputs: Next token probabilities conditioned on prefix and retrieved document: p(x_i | [R(x_<i); x_<i])

Pipeline Flow

Prefix Input -> Query Formulation -> Retrieval (BM25/Dense) -> Reranking (Optional) -> Context Prepending -> LM Generation

System Modules

Query Formulator

Extracts the last l tokens from the current prefix to form a search query

Model or implementation: Deterministic slicing

Retriever (Retrieval & Selection)

Retrieves top-k relevant documents from the external corpus

Model or implementation: Off-the-shelf BM25 (Pyserini) or Dense Retriever (Contriever/Spider)

Reranker (Retrieval & Selection)

Re-scores candidate documents to select the one most helpful for next-token prediction

Model or implementation: Zero-shot LM or Fine-tuned RoBERTa-base

Reader (Language Model)

Generates next tokens conditioned on the concatenation of the selected document and the current prefix

Model or implementation: GPT-2, GPT-Neo, OPT, or LLaMA (Frozen)

Novel Architectural Elements

Zero-modification reading mechanism: Prepending documents directly to the input context buffer rather than using cross-attention or specialized layers
LM-supervised reranker: Training a bidirectional reranker using the frozen LM's perplexity reduction as the supervision signal

Modeling

Base Model: Range of models: GPT-2 (110M-1.5B), GPT-Neo/J (1.3B-6B), OPT (125M-66B), LLaMA (7B-33B)

Training Method: Predictive Reranker training only (LM remains frozen)

Objective Functions:

Purpose: Minimize the negative log likelihood of the correct document, where 'correct' is defined by the LM's preference.

Formally: -log( p_rank(d_i|x) * p_theta(y|[d_i; x]) )

Trainable Parameters: RoBERTa-base reranker weights (approx 125M parameters)

Training Data:

300,000 examples from WikiText-103 training set
Labels derived from LM perplexity on ground truth continuation

Key Hyperparameters:

learning_rate: 1e-5 (peak)
batch_size: 32
training_steps: 10000
+ 2 more
retrieval_stride_s: 4
query_length_l: 32

Compute: Not reported in the paper

Comparison to Prior Work

vs. kNN-LM: Does not require building a datastore of token representations (computationally cheaper storage)
vs. RETRO: Does not require pre-training or fine-tuning the LM; works with off-the-shelf models
vs. REPLUG [not cited in paper]: Similar prepending approach, but REPLUG trains the retriever while In-Context RALM focuses on off-the-shelf retrievers and rerankers

Limitations

Runtime cost increased by frequent retrieval operations (every 4 tokens) and re-computing attention over the document
Context window limitations restrict the number/length of documents that can be prepended
BM25 outperforms dense retrievers here, which contradicts trends in other IR tasks, suggesting dense retrievers may need specific tuning for LM tasks
Experiments limited to a single retrieved document in most LM settings (though up to 2 used in ODQA)

Reproducibility

Code: https://github.com/AI21Labs/in-context-ralm

Code is publicly available at https://github.com/AI21Labs/in-context-ralm. Uses standard datasets (WikiText-103, The Pile) and open-source models (HuggingFace). Dense retrieval implemented via DPR/FAISS; sparse via Pyserini.

📊 Experiments & Results

Evaluation Setup

Language modeling (next token prediction) and Open-Domain QA

Benchmarks:

WikiText-103 (Language Modeling)
RealNews (Language Modeling)
The Pile (ArXiv, StackExchange, FreeLaw) (Language Modeling)
Natural Questions (NQ) (Open-Domain QA)
TriviaQA (Open-Domain QA)

Metrics:

Perplexity (Word-level and Token-level)
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Language modeling results on WikiText-103 showing massive gains from simple BM25 retrieval, with further gains from reranking.
WikiText-103	Word-level Perplexity	20.0	16.6	-3.4
WikiText-103	Word-level Perplexity	37.5	29.6	-7.9
WikiText-103	Word-level Perplexity	16.6	15.4	-1.2
Results on large-scale models (OPT) demonstrate that In-Context RALM bridges the gap between model sizes.
WikiText-103	Word-level Perplexity	10.0	10.0	0.0
Open-Domain QA results validating the approach on downstream tasks.
Natural Questions	Exact Match	12.0	31.0	+19.0
TriviaQA	Exact Match	54.8	60.1	+5.3

Experiment Figures

Perplexity scaling curves for OPT models (125M to 66B) on WikiText-103 and RealNews

Comparison of different retrievers (BM25 vs. BERT vs. Contriever vs. Spider) on WikiText-103

Main Takeaways

BM25 consistently outperforms off-the-shelf dense retrievers (Contriever, BERT) for language modeling tasks, contrary to trends in semantic search.
High retrieval frequency (small stride s=4) significantly improves performance compared to infrequent retrieval (s=64), suggesting 'high resolution' grounding is key.
There is a 'sweet spot' for query length (approx 32 tokens); too short lacks context, too long dilutes local relevance.
A smaller proxy model can be used for zero-shot reranking with minimal performance degradation compared to using the main model.

📚 Prerequisite Knowledge

Prerequisites

Autoregressive language modeling (next-token prediction)
Information Retrieval (BM25 vs. Dense Retrieval)
In-context learning (prompting)
Perplexity as an evaluation metric

Key Terms

RALM: Retrieval-Augmented Language Modeling—conditioning a language model on relevant documents during generation

In-Context RALM: The proposed method of prepending retrieved documents to the LM's input context without updating LM weights

Retrieval Stride: The interval (number of tokens) between consecutive retrieval operations during text generation

Retrieval Query Length: The number of recent tokens from the current prefix used to formulate the search query

BM25: Best Matching 25—a standard bag-of-words ranking function used for sparse retrieval

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better performance

Predictive Reranking: A proposed method where a reranker is trained to select documents that maximize the likelihood of the ground-truth continuation text

Zero-shot Reranking: Using a frozen LM to score retrieved documents based on the likelihood of the immediate context prefix given the document

Contriever: A dense retrieval model trained using contrastive learning

ODQA: Open-Domain Question Answering—answering questions based on a large collection of documents