Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

📝 Paper Summary

Modularized RAG pipeline

Adaptive-RAG uses a small classifier to predict query complexity and dynamically selects the most efficient strategy—no retrieval, single-step retrieval, or multi-step retrieval—for each query.

Core Problem

Existing RAG approaches use a 'one-size-fits-all' strategy: simple questions waste compute on unnecessary retrieval steps, while complex multi-hop questions fail with simple retrieval methods.

Why it matters:

Real-world user queries vary widely in complexity, from simple fact lookups to complex reasoning chains
Applying multi-step retrieval to every query creates massive computational overhead
Applying single-step or no retrieval to complex queries results in incorrect answers

Concrete Example: For the simple query 'Paris is the capital of what?', a multi-step RAG system wastes resources searching documents. Conversely, for 'When did the people who captured Malakoff come to the region where Philipsburg is located?', a single-step RAG fails because it cannot connect the four necessary reasoning steps.

Key Novelty

Complexity-Based Adaptive RAG Strategy Selection

Classify incoming queries into three complexity levels (A: answerable by LLM, B: single-step retrieval, C: multi-step retrieval) using a smaller language model
Dynamically route the query to the most appropriate solver based on the predicted complexity, avoiding unnecessary computation for simple queries and ensuring sufficiency for complex ones
Automatically generate training labels for the classifier using model predictions and dataset inductive biases (e.g., multi-hop datasets imply complexity C)

Architecture

Conceptual diagram comparing the proposed Adaptive-RAG against 'Simple' (A) and 'Complex' (B) approaches. It shows the Classifier directing queries to one of three paths.

Evaluation Highlights

Achieves higher accuracy than adaptive baselines like 'Adaptive Retrieval' (+5.5% on Multi-hop datasets) while maintaining efficiency
Reduces computational cost significantly compared to always-on multi-step methods (e.g., 40-50% faster inference than Iter-Retgen)
Outperforms single-step RAG by ~12-14% on complex multi-hop benchmarks like HotpotQA and 2WikiMultihopQA

Breakthrough Assessment

7/10

A practical, effective approach to the efficiency-accuracy trade-off in RAG. While the core idea of adaptive retrieval isn't new, the specific implementation of a classifier trained on 'silver' labels from model outcomes is a solid engineering contribution.

⚙️ Technical Details

Problem Definition

Setting: Open-domain Question Answering handling queries of varying complexities (single-hop to multi-hop)

Inputs: Natural language query q

Outputs: Predicted answer a

Pipeline Flow

Complexity Classifier (predicts label A, B, or C)
Strategy Selector (routes to Non-RAG, Single-RAG, or Multi-RAG based on label)
Selected QA Model (executes the chosen strategy to produce answer)

System Modules

Complexity Classifier

Predict the complexity level of the input query (A, B, or C)

Model or implementation: Smaller LM (e.g., T5-Large or T5-Base)

Non-Retrieval Solver (Strategy A) (Execution)

Answer simple queries using only parametric knowledge

Model or implementation: LLM (e.g., FLAN-T5-XXL, GPT-3.5)

Single-Step Retrieval Solver (Strategy B) (Execution)

Answer moderate queries using one round of retrieval

Model or implementation: LLM + Retriever

Multi-Step Retrieval Solver (Strategy C) (Execution)

Answer complex queries using iterative retrieval and generation

Model or implementation: LLM + Retriever (iterative)

Novel Architectural Elements

A pre-execution classifier that routes queries to one of three distinct architectural strategies (No-RAG, Single-RAG, Multi-RAG) rather than a single dynamic architecture

Modeling

Base Model: FLAN-T5-XXL or GPT-3.5 (turbo-instruct) for the solver; T5-Base/Large/XL for the classifier

Training Method: Supervised Fine-Tuning (SFT) of the classifier

Objective Functions:

Purpose: Minimize classification error against silver labels.

Formally: Cross-entropy loss between predicted complexity logits and silver labels.

Training Data:

Labels constructed via two steps:
1. Model-prediction based: Run all 3 strategies (No-RAG, Single, Multi). If No-RAG is correct -> Label A. If only Single/Multi correct -> Label B. If only Multi correct -> Label C.
2. Dataset-bias based (for unlabeled queries): Single-hop dataset queries -> Label B. Multi-hop dataset queries -> Label C.

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Not explicitly reported in the paper

Comparison to Prior Work

vs. Adaptive Retrieval: Adaptive-RAG handles 3 levels (including multi-step) rather than just binary retrieve/no-retrieve based on entity popularity
vs. Self-RAG: Uses a separate, lightweight classifier before generation rather than generating special tokens during generation [not cited in paper as direct baseline, but mentioned as concurrent work]
vs. Iter-Retgen: Selectively applies iterative steps only when necessary, saving compute

Limitations

Relies on the quality of the 'silver labels' which are derived from model correctness; if the base models are weak, the classifier learns poor boundaries.
Requires running inference on training data with multiple models to generate labels, which is a one-time cost.
The three complexity classes (A, B, C) are coarse-grained; real-world complexity might be more continuous.

Reproducibility

Code: https://github.com/starsuzi/Adaptive-RAG

Code is publicly available at https://github.com/starsuzi/Adaptive-RAG. The method relies on off-the-shelf retrievers and LLMs, making it relatively reproducible given the silver-label construction heuristic described.

📊 Experiments & Results

Evaluation Setup

Open-domain QA on datasets of varying complexity

Benchmarks:

SQuAD (Single-hop QA)
Natural Questions (NQ) (Single-hop QA)
TriviaQA (Single-hop QA)
HotpotQA (Multi-hop QA)
2WikiMultihopQA (Multi-hop QA)
MuSiQue (Multi-hop QA)

Metrics:

Exact Match (EM)
F1 score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison results showing Adaptive-RAG's performance against baselines on aggregated Multi-hop datasets (HotpotQA, 2WikiMultihopQA, MuSiQue).
Multi-hop Avg (HotpotQA, 2Wiki, MuSiQue)	EM	31.0	36.5	+5.5
Multi-hop Avg (HotpotQA, 2Wiki, MuSiQue)	EM	24.5	36.5	+12.0
Multi-hop Avg (HotpotQA, 2Wiki, MuSiQue)	EM	35.2	36.5	+1.3
Performance on Single-hop datasets (SQuAD, NQ, TriviaQA) showing Adaptive-RAG does not regress on simpler queries.
Single-hop Avg (SQuAD, NQ, TriviaQA)	EM	46.1	46.3	+0.2
Single-hop Avg (SQuAD, NQ, TriviaQA)	EM	34.0	46.3	+12.3

Experiment Figures

Efficiency vs. Accuracy plot. X-axis: Inference Time (Efficiency), Y-axis: Accuracy (EM).

Main Takeaways

Adaptive-RAG successfully bridges the gap between efficiency and complexity: it matches the performance of expensive multi-step methods on complex data while avoiding their cost on simple data.
The 'silver label' construction strategy (using model outcomes + dataset bias) is effective for training the complexity classifier without human annotation.
The classifier (T5-Large) is lightweight enough that its overhead is negligible compared to the savings from avoiding unnecessary retrieval steps.
Existing adaptive baselines (like Mallen et al.) that rely on heuristics like entity popularity struggle to generalize across the full spectrum of query complexities.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Retrieval-Augmented Generation (RAG)
Familiarity with multi-hop vs. single-hop QA tasks
Basic knowledge of classifier training with distillation or silver labels

Key Terms

RAG: Retrieval-Augmented Generation—AI systems that answer questions by searching for relevant documents

single-hop QA: Questions where the answer can be found in a single document or reasoning step

multi-hop QA: Questions requiring reasoning across multiple documents (e.g., bridging entity A to B to C)

Iter-Retgen: Iterative Retrieval-Generation—a method that interleaves generation and retrieval steps multiple times

silver labels: Training labels generated automatically (e.g., by checking which model answers correctly) rather than by humans

inductive bias: Assumptions built into a learning algorithm or data (e.g., assuming all questions in a 'multi-hop' dataset are complex)

FLAN-T5: A family of instruction-tuned language models based on the T5 architecture

Retriever: A module (like DPR or Contriever) that finds relevant documents from a large corpus

Reader: The LLM component that processes the query and retrieved documents to generate an answer