Distilling LLM Agent into Small Models with Retrieval and Code Tools

📝 Paper Summary

Agentic RAG pipeline Reasoning Distillation

Agent Distillation transfers interactive tool-use capabilities from large language models to smaller ones by training them on reason-act-observe trajectories rather than static reasoning traces.

Core Problem

Standard chain-of-thought (CoT) distillation teaches small models to mimic reasoning traces but fails to impart the ability to interact with external tools or verify computations.

Why it matters:

Small models often hallucinate facts or fail at precise arithmetic when relying solely on internal weights via CoT
Naive distillation of reasoning traces does not generalize well to out-of-distribution tasks requiring new knowledge or complex calculations
Existing methods focus on distilling static reasoning, missing the dynamic 'agentic' behavior (acting and observing) crucial for solving complex real-world problems

Concrete Example: For the question 'What would $100 invested in Apple stock in 2010 be worth by 2020?', a CoT-distilled model might hallucinate the stock price or miscalculate the multiplication. An agent-distilled model would generate Python code to retrieve the stock history and perform the calculation exactly.

Key Novelty

Agent Distillation with First-Thought Prefix and Self-Consistent Action Generation

Trains small models on interactive 'Thought -> Action -> Observation' trajectories (using code/retrieval tools) instead of static text-only reasoning traces
Aligns the teacher's agentic behavior with its internal reasoning strengths by forcing the first agent thought to match a standard Chain-of-Thought step (First-Thought Prefix)
Enhances test-time robustness by sampling multiple action candidates and selecting the one that executes successfully and produces consistent outputs (Self-Consistent Action Generation)

Architecture

Overview of the two proposed methods: First-Thought Prefix (FTP) for teacher trajectory generation and Self-Consistent Action Generation (SAG) for student inference.

Evaluation Highlights

Distilled 0.5B, 1.5B, and 3B agent models achieve performance comparable to next-tier larger (1.5B, 3B, 7B) CoT-distilled models on average
7B Agent model outperforms the teacher-sized 32B CoT model on average across 8 reasoning benchmarks
Significant gains on out-of-domain mathematical tasks: 1.5B agent matches 3B CoT performance on benchmarks like GSM-Hard and OlymMATH

Breakthrough Assessment

8/10

Strong empirical evidence that distilling interactive behavior (agency) is more parameter-efficient than distilling static reasoning (CoT), enabling very small models to solve complex tasks.

⚙️ Technical Details

Problem Definition

Setting: Distilling an interactive teacher policy (agent) into a student policy via supervised fine-tuning on generated trajectories

Inputs: Natural language query x

Outputs: A sequence of interleaved thoughts, actions (code/search), and observations leading to a final answer

Pipeline Flow

Input Query -> Student Agent Model -> [Loop: Thought -> Action Generation -> Code Interpreter/Environment -> Observation] -> Final Answer

System Modules

Student Agent

Generates thoughts and code actions based on query and interaction history

Model or implementation: Qwen2.5-Instruct (0.5B, 1.5B, 3B, or 7B)

Code Interpreter

Executes Python code generated by the agent and returns output or errors

Model or implementation: Python Interpreter

Self-Consistent Action Generator (SAG)

Samples multiple actions, filters errors, and selects consistent outcome

Model or implementation: Heuristic / Voting Mechanism

Novel Architectural Elements

Integration of First-Thought Prefix (FTP) during data generation to stabilize teacher trajectories
Integration of Self-Consistent Action Generation (SAG) during student inference to filter invalid code

Modeling

Base Model: Qwen2.5-Instruct (0.5B, 1.5B, 3B, 7B)

Training Method: Supervised Fine-Tuning (Distillation) on Teacher Trajectories

Objective Functions:

Purpose: Minimize difference between student and teacher distributions on valid trajectories.

Formally: Standard next-token prediction loss L_distill = - sum log p_S(y_t | x, y_<t) over valid teacher trajectories, excluding observations.

Adaptation: LoRA (rank 64) on all linear layers

Trainable Parameters: Not reported in the paper

Training Data:

1,000 HotPotQA examples (factual)
2,000 MATH examples (mathematical)
Trajectories generated by Qwen2.5-32B-Instruct teacher
Filtered for correctness, resulting in ~2,000 total trajectories

Key Hyperparameters:

learning_rate: 2e-4
batch_size: 8
epochs: 2
+ 4 more
lora_rank: 64
inference_max_steps: 5
sag_samples_N: 8
sag_temperature: 0.4

Compute: Four NVIDIA A100 80GB GPUs

Comparison to Prior Work

vs. CoT Distillation: Distills interactive CodeAct trajectories (Thought-Action-Observation) instead of static text traces
vs. FireAct: Focuses on distilling into much smaller models (0.5B-3B) rather than >=7B, and introduces trajectory improvement methods (FTP, SAG)
vs. Tora: Uses a general CodeAct framework rather than specialized math tool formats and emphasizes small-model generalization

Limitations

Agentic behavior can be out-of-distribution for pre-trained sLMs, potentially degrading performance on tasks well-suited for pure CoT
sLMs (especially 0.5B) struggle to produce syntactically correct code, requiring heavy reliance on error filtering (SAG)
Larger models (3B/7B) sometimes benefit less from tools on standard tasks (MATH500) where their internal knowledge is already sufficient

Reproducibility

Code: https://github.com/Nardien/agent-distillation

Publicly available: code (https://github.com/Nardien/agent-distillation). Missing: exact training time. Dependencies: Uses Qwen2.5-32B-Instruct as teacher and Wikipedia 2018 dump for retrieval.

📊 Experiments & Results

Evaluation Setup

Evaluated on Factual (RAG-style) and Mathematical reasoning tasks, testing both In-Domain and Out-of-Domain (OOD) generalization.

Benchmarks:

HotPotQA (Factual Reasoning (In-Domain))
Musique (Factual Reasoning (OOD))
Bamboogle (Factual Reasoning (OOD))
Feversous (Factual Reasoning (OOD))
MATH500 (Mathematical Reasoning (In-Domain))
GSM-Hard (Mathematical Reasoning (OOD))
AIME (Mathematical Reasoning (OOD))
OlymMATH (Mathematical Reasoning (OOD))

Metrics:

Exact Match (Math)
LLM-as-a-judge accuracy (Factual, using gpt-4o-mini)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparative analysis across model scales shows Agent Distillation consistently outperforming CoT Distillation, with small agents matching larger CoT models.
Average (8 tasks)	Accuracy	26.3	31.9	+5.6
Average (8 tasks)	Accuracy	41.8	49.6	+7.8
Average (8 tasks)	Accuracy	47.6	56.4	+8.8
Ablation on Out-of-Domain Mathematical Reasoning shows the specific benefit of agentic tools for harder problems.
GSM-Hard	Exact Match	35.2	48.4	+13.2
OlymMATH	Exact Match	38.2	49.8	+11.6

Experiment Figures

Performance comparison (Average Accuracy) across model sizes (0.5B to 7B) for CoT Distillation vs. Agent Distillation.

Main Takeaways

Agent Distillation allows sLMs (0.5B-3B) to punch above their weight class, often matching CoT models 2-4x their size.
The proposed First-Thought Prefix (FTP) improves teacher trajectory quality by anchoring agentic reasoning to instruction-tuned CoT patterns.
Self-Consistent Action Generation (SAG) is crucial for small models, filtering out frequent syntax/execution errors common in sLM code generation.
Gains are most pronounced in Out-of-Domain settings (unseen math/factual tasks), confirming that learning *how* to use tools generalizes better than memorizing reasoning traces.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Knowledge Distillation
Language Agents (ReAct, CodeAct frameworks)
Rejection Sampling / Majority Voting

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

sLM: Small Language Model—typically models with fewer than 7 billion parameters

Agentic Behavior: The ability of an AI to autonomously reason, plan, and execute actions (like running code or searching) to solve a task

Trajectory: A sequence of interactions consisting of thoughts, actions, and environmental observations (outputs from tools)

CodeAct: A framework where LLMs use executable code (e.g., Python) as their primary form of action/tool use

First-Thought Prefix (FTP): A proposed method where the initial reasoning step from a CoT prompt is forced as the prefix for an agent's first thought to align behavior

Self-Consistent Action Generation (SAG): A test-time inference method that samples multiple action sequences, filters out execution errors, and selects the most consistent result via voting

RAG: Retrieval-Augmented Generation—providing models with external documents to ground their answers

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights