Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

📝 Paper Summary

Agentic RAG pipeline Compact Language Models (SLMs)

DGPO enables compact language models (0.5–1B parameters) to achieve sophisticated agentic RAG behaviors by combining cold-start distillation from a teacher with selective reinforcement learning that corrects only wrong reasoning paths.

Core Problem

Applying Reinforcement Learning (RL) to compact models for agentic RAG fails because their poor initial capabilities lead to sparse rewards and unstable training, while standard distillation suffers from exposure bias.

Why it matters:

Agentic RAG systems currently rely on massive LLMs, making them inaccessible for resource-constrained environments.
Smaller models (0.5-1B) struggle with the 'cold-start' problem in RL; they rarely generate correct search trajectories on their own to learn from.
Standard distillation methods fail to transfer the complex decision-making process (when to search vs. answer) needed for autonomous agents.

Concrete Example: When asked a complex multi-hop question like 'Whose album was Red?', a naive compact model might guess directly or search literally for keywords. A larger teacher model knows to rewrite the query to 'Red album artist'. Without guidance, the compact model rarely discovers this strategy on its own, leading to zero reward and no learning.

Key Novelty

Distillation-Guided Policy Optimization (DGPO)

Initializes the student via cold-start distillation on teacher-generated correct trajectories to establish a baseline capability.
During RL, uses a selective 'mimic if wrong, reward if right' mechanism: the student is rewarded for correct answers but penalized with KL divergence toward the teacher only when it fails.
Introduces Agentic RAG Capabilities (ARC), a fine-grained evaluation metric decomposing performance into thinking, query rewriting, and source referencing components.

Architecture

The DGPO training framework illustrating the two-phase process: Cold-Start Initialization and Distillation-Guided RL.

Evaluation Highlights

DGPO with a 0.5B student outperforms the 3B teacher model on NQ (48.1 vs 47.9), PopQA (45.3 vs 44.2), and HotpotQA (44.6 vs 43.5) datasets.
Achieves highest average Exact Match score (40.0) across 7 QA datasets compared to PPO (36.0), KD (38.8), and GKD (31.1) using Qwen2.5-0.5B.
Generalizes across model families: Llama-3-1B student trained via DGPO outperforms standard PPO by +4.8 points on average (37.3 vs 32.5).

Breakthrough Assessment

8/10

Demonstrates that extremely small models (0.5B) can outperform significantly larger teachers in agentic tasks, challenging the assumption that agentic RAG requires massive parameters.

⚙️ Technical Details

Problem Definition

Setting: Agentic RAG where an LLM functions as a policy making sequential decisions (thought, search, answer) at each timestep

Inputs: User question x and an external retrieval system R

Outputs: Sequence of actions y containing structured tokens <think>, <search>, <information>, <answer>

Pipeline Flow

Agent (Student Policy) receives question
Action Generation (Think/Search/Answer)
Environment Interaction (Search Engine)
Reward/Penalty Calculation (Selective KL + EM Reward)

System Modules

Student Agent

Generates thought, search queries, or final answers

Model or implementation: Qwen2.5-0.5B-instruct (also tested 7B, Llama-3-1B, Llama-3-8B)

Retriever

External search engine returning documents for queries

Model or implementation: E5 retriever + Wikipedia 2018 dump

Teacher Model

Provides reference distribution for KL penalty when student is incorrect

Model or implementation: Search-R1-PPO-3B (based on Qwen2.5-3B-instruct)

Novel Architectural Elements

Selective KL penalty mechanism: Applies teacher regularization (KL divergence) ONLY to incorrect student trajectories, while allowing pure RL exploration for correct ones

Modeling

Base Model: Qwen2.5-0.5B-instruct

Training Method: Distillation-Guided Policy Optimization (DGPO)

Objective Functions:

Purpose: Cold-start initialization.

Formally: L_distill = L_CE(pi_g, pi_theta) + lambda * D_KL[pi_g || pi_theta] on filtered correct Teacher-Generated Outputs.
Purpose: RL optimization with selective guidance.

Formally: Maximize PPO objective with reward r_phi(x,y).
Purpose: Define reward signal.

Formally: r_phi(x,y) = 1 if y=y* (correct), else -beta * D_KL[pi_theta || pi_g] (mimic teacher if wrong).

Training Data:

Training sets of NQ (Natural Questions) and HotpotQA
Teacher generates trajectories; only correct ones are kept for initialization

Key Hyperparameters:

computational_requirements: 8 x NVIDIA H200 GPUs
retrieved_passages: 3

Comparison to Prior Work

vs. Search-R1 (PPO): DGPO uses active teacher guidance (KL on errors) rather than just a frozen reference model, stabilizing training for small models.
vs. SeqKD/Standard KD: DGPO transitions to RL to allow exploration beyond the teacher's distribution, whereas KD is limited to supervised imitation.
vs. GKD: DGPO uses a selective penalty mechanism (conditional on correctness) and a distinct cold-start phase, preventing collapse from noisy early student outputs.
+ 1 more
vs. RAFT: Focuses on optimizing the RAG process (search/reasoning) rather than just robustness to distractors [not cited in paper].

Limitations

Over-optimization for simple queries: RL phase sometimes degrades performance on complex multi-hop reasoning (MuSiQue) compared to pure KD.
Dependence on teacher quality: Requires a capable teacher model to generate initial trajectories and guidance.
Evaluation metric limited to Exact Match (EM), which may not capture partial correctness in long-form answers.

Reproducibility

Code availability is not provided in the paper. Dataset details (NQ, HotpotQA, etc.) and base models (Qwen, Llama) are public.

📊 Experiments & Results

Evaluation Setup

Open-domain QA with external retrieval (Wikipedia 2018) via E5 retriever

Benchmarks:

NQ (Natural Questions) (General QA (Single-hop))
TriviaQA (General QA)
PopQA (General QA)
HotpotQA (Multi-hop QA)
2WikiMultiHopQA (Multi-hop QA)
MuSiQue (Multi-hop QA)
Bamboogle (Multi-hop QA)

Metrics:

Exact Match (EM)
Hit Ratio (for retrieval)
Agentic RAG Capabilities (ARC) metrics
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison across 7 datasets showing DGPO superiority over baselines and sometimes the teacher.
Average (7 datasets)	Exact Match (EM)	36.0	40.0	+4.0
HotpotQA	Exact Match (EM)	43.5	44.6	+1.1
NQ	Exact Match (EM)	41.8	48.1	+6.3
Cross-architecture generalization results.
Average (7 datasets)	Exact Match (EM)	32.5	37.3	+4.8
Ablation study demonstrating the necessity of each component.
Average (7 datasets)	Exact Match (EM)	32.2	40.0	+7.8
Average (7 datasets)	Exact Match (EM)	39.1	40.0	+0.9

Experiment Figures

Comparison of Prompt-based vs. RL-based performance across different model sizes (0.5B to 32B).

Main Takeaways

DGPO enables 0.5B models to surpass 3B teacher models on specific datasets (NQ, PopQA, HotpotQA), proving compact models can be effective agents.
Pure RL (PPO) and On-policy Distillation (GKD) fail on compact models due to the cold-start problem (extremely low initial success rate).
ARC evaluation reveals DGPO excels at 'Source Referencing' (extracting answers) but sometimes over-optimizes 'Thinking' steps for simpler queries, occasionally performing slightly worse on complex reasoning (MuSiQue) than pure KD.
The 'selective KL penalty' is crucial: it allows the student to diverge from the teacher when the student is correct (exploration) but forces mimicry when the student fails (correction).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Knowledge Distillation
Retrieval-Augmented Generation (RAG)

Key Terms

SGO: Student-Generated Outputs—sequences generated by the model currently being trained

TGO: Teacher-Generated Outputs—sequences generated by a larger, more capable frozen model

Agentic RAG: A framework where LLMs autonomously coordinate retrieval, query reformulation, and evidence integration using special action tokens

Cold-start problem: In RL, when a model is too weak to ever generate a correct solution, it never receives a positive reward signal and thus cannot learn

ARC: Agentic RAG Capabilities—a metric proposed in this paper analyzing reasoning, search coordination, and response synthesis separately

Exact Match (EM): A metric checking if the generated answer string exactly matches the ground truth

PPO: Proximal Policy Optimization—an RL algorithm that updates a policy in stable steps

KL divergence: A statistical distance measuring how one probability distribution differs from a reference distribution

Exposure bias: A problem in training where a model learns from ground-truth history during training but must generate its own history during inference, leading to error accumulation