SR-RAG: Binding Selective Retrieval with Knowledge Verbalization

📝 Paper Summary

Modularized RAG pipeline

SR-RAG reformulates selective retrieval as a routing problem where an LLM dynamically chooses between retrieving external documents or verbalizing its own parametric knowledge before answering.

Core Problem

Existing selective retrieval methods skip retrieval by falling back to standard generation, failing to utilize the LLM's full potential to explicitly articulate (verbalize) its internal knowledge.

Why it matters:

Current fallbacks limit performance upper bounds when retrieval is abstained, as LLMs perform better when explicitly reasoning or reciting knowledge
Training labels for selective retrieval are often inaccurate because they underestimate the LLM's internal capabilities without explicit knowledge elicitation
Standard RAG systems suffer from high latency and distraction from low-quality retrieved documents

Concrete Example: For the query 'Who succeeded the first President of Namibia?', a standard LLM might fail directly. However, if prompted to verbalize its knowledge first ('Namibia has had four presidents...'), it can answer correctly without external retrieval. SR-RAG captures this capability to avoid unnecessary retrieval.

Key Novelty

Self-Routing Retrieval-Augmented Generation (SR-RAG)

Reformulate selective retrieval as a 'knowledge source selection' problem where the model chooses between external sources (Wikipedia) and internal sources (Self)
incorporate explicit 'knowledge verbalization' (generating background context from memory) when retrieval is skipped, rather than just generating the answer directly
Use a nearest-neighbor (kNN) inference policy to dynamically adjust the selection decision based on the hidden states of similar past queries

Architecture

Overview of SR-RAG inference pipeline showing the decision branch between External Retrieval and Knowledge Verbalization

Evaluation Highlights

Outperforms vanilla selective retrieval by 7.9% (PopQA), 2.1% (TriviaQA), and 4.7% (PubHealth) while performing significantly fewer retrievals
Reduces retrieval volume by 26% to 40% compared to strong selective retrieval baselines while maintaining or improving accuracy
Achieves better accuracy-latency trade-offs: as verbalization increases, system latency decreases linearly while maintaining high accuracy

Breakthrough Assessment

7/10

Strong conceptual advance in treating 'self' as a distinct RAG source. Significant efficiency gains. However, reliant on existing verbalization techniques (GenRead) and standard dense retrieval.

⚙️ Technical Details

Problem Definition

Setting: Knowledge-Intensive Question Answering with optional retrieval

Inputs: User query q

Outputs: Response generated by LLM M based on selected knowledge source s(q)

Pipeline Flow

Input Processing: Query → <EOQ>
Source Selection: LLM predicts source token + kNN adjustment → Select Source
Knowledge Acquisition: Retrieve external docs OR Verbalize internal knowledge
Generation: <EOK> → Final Answer

System Modules

Source Selector (Retrieval & Selection)

Decide whether to retrieve external data or generate internal knowledge

Model or implementation: Llama-2-7B-Chat / Phi-3.5 / Qwen2.5 (fine-tuned)

Knowledge Verbalizer (Retrieval & Selection)

Generate background context from parametric knowledge if Internal source selected

Model or implementation: Same shared LLM

External Retriever (Retrieval & Selection)

Fetch documents from Wikipedia if External source selected

Model or implementation: Dense Retriever (DPR-based)

Response Generator

Generate final answer conditioned on query and acquired knowledge

Model or implementation: Same shared LLM

Novel Architectural Elements

Dynamic inference routing via kNN: Augmenting the LLM's token probability for source selection with a k-nearest neighbor search over a datastore of hidden states from training examples

Modeling

Base Model: Llama-2-7B-Chat, Phi-3.5-mini-instruct, Qwen2.5-7B-Instruct

Training Method: Two-stage Fine-tuning: Behavior Cloning (Stage 1) + Direct Preference Optimization (Stage 2)

Objective Functions:

Purpose: Learn to select the correct knowledge source.

Formally: L_src = -log p(<s>|q)
Purpose: Learn to verbalize useful knowledge (only when self is preferred).

Formally: L_verb = -log p(c_i+|q)
Purpose: Learn to generate the answer given knowledge.

Formally: L_ans = -log p(a|q, c)
Purpose: Align verbalization to generate high-quality knowledge over low-quality.

Formally: L_DPO_verb = -log σ(β log (p(c+|q)/p_ref(c+|q)) - β log (p(c-|q)/p_ref(c-|q)))

Adaptation: Full fine-tuning

Trainable Parameters: All parameters of the LLM

Training Data:

Mixture of 6 datasets (WoW, NQ, FEVER, OBQA, ARC-Easy, ASQA)
53,042 total instances
Labels derived by comparing likelihood of answer given External Retrieval vs. Verbalization (GenRead)

Key Hyperparameters:

learning_rate_stage1: 1e-5
learning_rate_stage2: 5e-7
batch_size: 64
+ 2 more
epochs_stage1: 1
dpo_beta: 0.3

Compute: Training: ~10 hours on 8x A800 (80GB) GPUs

Comparison to Prior Work

vs. Self-RAG: SR-RAG explicitly verbalizes knowledge when retrieval is skipped (vs. direct answering) and uses kNN for robust routing [Self-RAG cited]
vs. GenRead: SR-RAG dynamically routes between GenRead and Retrieval, rather than always using one [GenRead cited]
vs. Standard Selective Retrieval (e.g. pivoting on uncertainty): SR-RAG uses performance-based labeling (which source yields better answer likelihood) rather than just query difficulty or uncertainty [cited]
+ 1 more
vs. SKR (Self-Knowledge Guided Retrieval): SR-RAG integrates the verbalization step into the training loop via DPO, rather than just using it for labeling [cited]

Limitations

Relies on the availability of a high-quality external datastore (Wikipedia) and retriever performance
Inference requires maintaining a kNN datastore for the routing policy, adding memory overhead
The threshold for the combined kNN/model probability requires manual tuning (though the paper claims robustness)
Verbalization latency can still be significant compared to direct answering (though faster than retrieval)

Reproducibility

Code: https://knowledge-nlp.github.io/naacl2025/papers/26.pdf

Code availability is not provided in the paper text. Dataset construction details (GenRead prompts, mixture statistics) are provided. Inference uses standard dense retriever (Karpukhin et al., 2020).

📊 Experiments & Results

Evaluation Setup

Open-domain QA and Fact Checking

Benchmarks:

PopQA (Long-tail Entity QA)
TriviaQA (Open-domain QA)
PubHealth (Fact Checking)
ARC Challenge (Science QA)

Metrics:

Accuracy (Exact Match or Accuracy depending on task)
Retrieval Frequency (%RAG)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SR-RAG consistently achieves comparable or better accuracy than Always RAG while significantly reducing the number of retrievals needed.
PopQA	Accuracy	0.565	0.566	+0.001
PopQA	% Retrieval	98	96	-2
PubHealth	Accuracy	0.589	0.715	+0.126
PubHealth	% Retrieval	100	40	-60
TriviaQA	Accuracy	0.641	0.664	+0.023
Ablation studies confirm the importance of kNN routing and DPO training.
Average (All 4 datasets)	Accuracy	0.634	0.644	+0.010
Average (All 4 datasets)	Accuracy	0.616	0.644	+0.028

Experiment Figures

Bar chart comparing answer likelihoods under Direct Answering vs. RAG vs. GenRead (Verbalization)

Trade-off curves between Latency and Accuracy as verbalization frequency increases

Main Takeaways

Binding selective retrieval with knowledge verbalization significantly outperforms standard selective retrieval (which falls back to direct answering).
Knowledge verbalization creates a higher 'performance upper bound' for the non-retrieval path, allowing the system to skip retrieval more often without accuracy loss.
The kNN-based inference policy is effective at adapting to the model's capability shifts after fine-tuning, improving routing accuracy.
SR-RAG adapts to dataset difficulty automatically: high retrieval for long-tail tasks (PopQA) and high verbalization for common knowledge/reasoning (PubHealth).

📚 Prerequisite Knowledge

Prerequisites

Retrieval-Augmented Generation (RAG) architecture
Knowledge Distillation / Behavior Cloning
Direct Preference Optimization (DPO)
Nearest Neighbor Search (kNN)

Key Terms

Knowledge Verbalization: The process of prompting an LLM to generate relevant background information from its internal weights (parametric knowledge) before answering a question

Selective Retrieval: An inference strategy where the system dynamically decides whether to perform retrieval or rely on the model's internal knowledge

Parametric Knowledge: Information stored implicitly in the neural network weights of an LLM, acquired during pre-training

GenRead: A specific prompting method used to elicit (verbalize) knowledge from an LLM by asking it to generate a background document

DPO: Direct Preference Optimization—a method to align language models to preferences without a separate reward model

kNN: k-Nearest Neighbors—an algorithm used here to find similar past queries in embedding space to help decide the best knowledge source

Behavior Cloning: A supervised learning approach where a model is trained to mimic the actions (here, source selection and generation) of an expert policy

Dense Retrieval: Retrieval based on semantic vector similarity (embeddings) rather than keyword matching