Self-RAG: Learning to retrieve, generate, and critique through self-reflection

📝 Paper Summary

Agentic RAG pipeline Reflective generation

Self-RAG trains a single language model to adaptively retrieve information and self-critique its own outputs using generated reflection tokens, enabling controllable trade-offs between factuality and creativity.

Core Problem

Standard RAG approaches retrieve indiscriminately (even when unnecessary) and lack mechanisms to verify if the retrieved content is relevant or if the final generation is supported by evidence.

Why it matters:

Indiscriminate retrieval can introduce noise or off-topic information, hurting versatility for creative tasks
Models often ignore retrieved context or hallucinate answers that contradict the evidence
Current methods lack fine-grained control over generation behaviors (e.g., forcing high citation accuracy versus fluent continuation) without retraining

Concrete Example: When asked a personal essay prompt like 'Write about your best vacation,' standard RAG might force retrieval of irrelevant travel guides, lowering quality. Conversely, for a factual query, standard RAG might retrieve a relevant document but the model still hallucinates an unsupported answer.

Key Novelty

Self-Reflective Retrieval-Augmented Generation (Self-RAG)

Trains an arbitrary LM to generate 'reflection tokens' (Retrieve, IsRel, IsSup, IsUse) alongside text, allowing the model to self-assess the need for retrieval and the quality of its output
Uses a 'critic' model to annotate a training corpus offline with these reflection tokens, then distills this capability into the generator via standard next-token prediction
Enables inference-time control (e.g., 'cite more frequently') by weighting reflection tokens during beam search without requiring model retraining

Architecture

Overview of the Self-RAG inference framework compared to standard RAG. Shows the generation of reflection tokens at each step.

Evaluation Highlights

Outperforms ChatGPT (retrieval-augmented) and Llama2-chat (retrieval-augmented) on Open-domain QA, reasoning, and fact verification tasks
Significantly improves factuality and citation accuracy for long-form generation tasks (e.g., biography generation) compared to standard RAG baselines
Achieves higher performance with 7B/13B parameters than larger proprietary models on diverse tasks like PubHealth and PopQA

Breakthrough Assessment

9/10

Introduces a highly flexible, training-efficient paradigm (reflection tokens) that solves major RAG pain points (indiscriminate retrieval, lack of self-verification) and enables controllable inference.

⚙️ Technical Details

Problem Definition

Setting: Conditional text generation where the model generates output y segment-by-segment given input x, optionally retrieving passages D

Inputs: Input prompt x

Outputs: Sequence of text tokens and special reflection tokens y = [y_1, ..., y_T]

Pipeline Flow

Retrieval Decision: Input → Generator decides [Retrieve]
Retrieval (Conditional): If [Retrieve]=Yes → Retriever fetches documents
Parallel Generation: Generator processes (Input + Doc) for multiple docs in parallel
Self-Reflection: Generator appends [IsRel], [IsSup], [IsUse] tokens to each candidate segment
Selection: Beam search selects best segment based on weighted reflection token scores

System Modules

Generator (M)

Generates task text AND reflection tokens to control the process

Model or implementation: Llama 2 (7B or 13B)

Retriever (R)

Retrieves relevant passages when requested by Generator

Model or implementation: Not explicitly specified in inference pipeline section (standard dense retriever implied)

Critic (C) [Training Only]

Annotates training corpus with reflection tokens to supervise the Generator

Model or implementation: Llama 2-7B fine-tuned on GPT-4 distilled data

Novel Architectural Elements

Integration of self-reflection tokens directly into the generator's vocabulary
On-demand retrieval mechanism controlled by the generator itself via [Retrieve] token
Segment-level beam search using reflection token probabilities as soft rewards for ranking

Modeling

Base Model: Llama 2 (7B and 13B)

Training Method: Supervised Fine-Tuning (SFT) on augmented corpus

Objective Functions:

Purpose: Train generator to predict next tokens including text and reflection tokens.

Formally: Standard next-token prediction loss L = - sum log p(y_t | x, y_<t) over the augmented vocabulary.

Training Data:

Critic training data: 4k-20k supervised examples per token type collected from GPT-4
Generator training data: 150k instructions from Alpaca and diverse reasoning/QA datasets, augmented with reflection tokens by the Critic model

Key Hyperparameters:

learning_rate: 2e-5
batch_size: 128 (global)
epochs: 3
+ 2 more
max_length: 2048
optimizer: AdamW

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard RAG: Self-RAG retrieves adaptively (only when needed) and filters irrelevant context via self-critique
vs. Toolformer: Self-RAG integrates critique of the *output* quality ([IsSup], [IsUse]) not just the tool call itself
vs. Reinforcement Learning from Human Feedback (RLHF): Self-RAG achieves control/alignment via supervised training on offline-generated tokens, avoiding the complexity/instability of PPO training

Limitations

Inference latency is higher than standard generation due to segment-level beam search and multiple critiques
Relies on the accuracy of the Critic model during data creation; if the Critic is flawed, the Generator learns flawed reflection
Requires carefully curated instruction tuning data to learn the reflection behaviors

Reproducibility

Code: https://selfrag.github.io/

Code and trained models (7B/13B) are publicly available at https://selfrag.github.io/. Critic training data collected via GPT-4 (proprietary), but the distilled Critic model is used for the main training, reducing dependency on GPT-4 for the final pipeline.

📊 Experiments & Results

Evaluation Setup

Evaluated on diverse tasks including Open-domain QA, Reasoning, and Fact Verification

Benchmarks:

PubHealth (Fact Verification)
PopQA (Open-domain QA)
Arc-Challenge (Reasoning)
Biography Generation (Long-form generation)

Metrics:

Accuracy
F1 Score
Citation Precision
Citation Recall
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Self-RAG (7B/13B) consistently outperforms much larger baselines (Llama2-chat, ChatGPT) and standard RAG baselines on knowledge-intensive and reasoning tasks.
PubHealth	Accuracy	56.4	70.4	+14.0
PopQA	Accuracy	47.7	52.2	+4.5
Biography Generation	FactScore	73.4	81.4	+8.0
Biography Generation	Citation Precision	40.3	93.0	+52.7

Main Takeaways

Self-RAG 7B and 13B models significantly outperform state-of-the-art LLMs (including ChatGPT) and standard retrieval-augmented models on diverse tasks.
The model effectively learns to cite evidence only when supported, leading to massive gains in Citation Precision compared to baselines.
Adaptive retrieval allows the model to perform well on both knowledge-intensive tasks (where it retrieves) and reasoning/creative tasks (where it minimizes retrieval), preserving versatility.

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Language Model fine-tuning / Instruction tuning
Beam search decoding strategies

Key Terms

reflection tokens: Special vocabulary tokens generated by the model to signal decisions (e.g., [Retrieve]) or assessments (e.g., [Relevant], [Supported])

critic model: An auxiliary model (initialized from Llama 2-7B) trained on GPT-4 distilled data to annotate the training corpus with reflection tokens

Retrieve token: A decision token indicating whether external documents are needed to answer the current query

IsRel token: A critique token indicating if a retrieved document provides useful information for the input

IsSup token: A critique token indicating if the generated response is fully supported by the retrieved evidence

IsUse token: A critique token indicating the overall utility/quality of the response

segment-level beam search: A decoding strategy where the model generates a full sentence/segment, evaluates it using reflection token probabilities, and selects the best path

control tokens: Special tokens used to guide generation style or content; here, reflection tokens serve as dynamic control tokens

knowledge distillation: Transferring capabilities from a large model (teacher, here GPT-4) to a smaller model (student, here the Critic model)