Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL

📝 Paper Summary

End-to-end Agent Foundation Models Multi-Agent Systems Tool-Integrated Reasoning

Chain-of-Agents (CoA) distills the collaborative capabilities of multi-agent systems into a single end-to-end model, enabling it to dynamically orchestrate role-playing and tool-use without complex prompt engineering.

Core Problem

Existing multi-agent systems rely on inefficient manual prompt engineering and rigid workflows, causing high computational overhead and preventing data-centric learning, while standard Tool-Integrated Reasoning (TIR) models lack the ability to support diverse role-playing and complex collaboration.

Why it matters:

Traditional multi-agent systems suffer from high token costs due to redundant inter-agent communication and struggle to generalize without extensive reconfiguration
Current LLMs are not natively trained to support multi-turn, multi-agent, and multi-tool workflows, relying instead on fragile prompt engineering
Bridging the gap between the flexibility of multi-agent systems and the efficiency of end-to-end models is crucial for scalable complex problem solving

Concrete Example: In a standard multi-agent system solving a deep research task, agents might exchange repetitive messages like 'Reviewing...' or 'Handing off to...', consuming tokens without progressing the state. CoA internalizes this handover, allowing a single model to switch from a 'Plan Agent' role to a 'Search Agent' role seamlessly within one generation stream.

Key Novelty

Chain-of-Agents (CoA) Paradigm & Agent Foundation Models (AFMs)

Interleaves reasoning thoughts, tool actions, and 'role' tokens within a single model's context window to simulate multi-agent collaboration end-to-end
Uses 'Multi-Agent Distillation' to convert execution trajectories from expert multi-agent systems (like OAgents) into linear training data for a single model
Employs progressive filtering and agentic reinforcement learning on verifiable tasks to refine tool orchestration and error correction

Evaluation Highlights

+3.8% improvement on GAIA (Level 3) over RL-enhanced WebDancer using a stronger backbone, achieving state-of-the-art 55.3% with Qwen-2.5-32B
Reduces inference cost (token consumption) by 84.6% compared to traditional multi-agent systems while maintaining competitive performance
Achieves 59.8% solve rate on AIME 2025, outperforming previous TIR methods like SimpleTIR and ReTool by over +10.5%

Breakthrough Assessment

9/10

Significantly advances agentic AI by successfully distilling complex multi-agent dynamics into a single efficient model, achieving SOTA across diverse web and code benchmarks while drastically reducing compute costs.

⚙️ Technical Details

Problem Definition

Setting: Complex query resolution via dynamic module orchestration within a unified model

Inputs: Natural language query q

Outputs: Final answer A derived from a trajectory of thoughts, tool actions, observations, and role transitions

Pipeline Flow

Thinking Agent (orchestrates reasoning)
Plan Agent (decomposes tasks)
Tool Agents (Search, Crawl, Code Generate)
Reflection/Verification Agents (critique and validate)

System Modules

Thinking Agent (Reasoning & Orchestration)

Orchestrate reasoning pipeline, activate specialized agents, maintain solution state

Model or implementation: Qwen-2.5-32B-Instruct (AFM)

Plan Agent (Reasoning & Orchestration)

Decompose query into structured task sequences

Model or implementation: Qwen-2.5-32B-Instruct (AFM)

Tool Agents (Search, Crawl, Code)

Execute domain-specific actions (e.g., formulate search queries, execute python code)

Model or implementation: Qwen-2.5-32B-Instruct (AFM)

Reflection/Verification Agent

Conduct self-critique, resolve inconsistencies, validate reasoning against formal criteria

Model or implementation: Qwen-2.5-32B-Instruct (AFM)

Novel Architectural Elements

Native multi-agent simulation within single-model decoding: The model outputs specific 'role' tokens to switch internal context, effectively running a multi-agent system in a single inference pass without external framework overhead.

Modeling

Base Model: Qwen-2.5-Instruct family (3B, 7B, 32B variants)

Training Method: Two-stage process: Agentic Supervised Fine-Tuning (SFT) followed by Agentic Reinforcement Learning (RL)

Objective Functions:

Purpose: Minimize the negative log-likelihood of the agent trajectory during SFT.

Formally: L_SFT = -∑ log π_θ(τ_t | τ_<t, q)
Purpose: Optimize policy for correct answers using RL.

Formally: R_web(τ) = score_answer (binary correctness via LLM-judge)
Purpose: Optimize policy for code/math with format constraints.

Formally: R_code(τ) = score_answer * score_format

Training Data:

SFT: ~16.4k high-quality trajectories distilled from OAgents (Web Agent + MHQA datasets)
RL: ~180k prompts (MHQA + Web Agent), filtered to exclude trivial questions (rq > 0.3) and very hard questions

Key Hyperparameters:

learning_rate: 1.4e-5
batch_size: 256
epochs: 2.5
+ 4 more
optimizer: AdamW with cosine decay
rl_rollouts: 8 rollouts per prompt
rl_max_steps: 24 steps
rl_max_tokens: 32k tokens

Compute: Not reported in the paper

Comparison to Prior Work

vs. Search-R1/WebThinker: CoA supports dynamic multi-agent role-playing (Plan, Reflect, Verify) rather than just a fixed ReAct loop
vs. OAgents/OWL: CoA runs as a single model end-to-end, removing communication overhead and enabling direct gradient optimization of the 'system' behavior
vs. Agent-FLAN [not cited in paper]: Agent-FLAN focuses on alignment data composition for general agent skills; CoA focuses specifically on distilling multi-agent trajectories into a single model architecture

Limitations

Relies on the quality of the 'teacher' multi-agent system (OAgents) for distillation data
Web agent reward function depends on LLM-as-a-Judge, which may introduce bias or instability
RL training filters out 'trivial' questions, potentially affecting performance on simple tasks if not balanced
Requires large context windows (32k+) to handle the verbose multi-agent trajectories

Reproducibility

Code: https://github.com/Opus-Force-Agent/Chain-of-Agents

publicly available (https://github.com/Opus-Force-Agent/Chain-of-Agents). Includes model weights, code, and training data. The distillation source (OAgents) is also open-source. RL training uses the VeRL framework.

📊 Experiments & Results

Evaluation Setup

Evaluated on both Web Agent tasks (QA, browsing) and Code Agent tasks (coding, math) using the Qwen-2.5 family.

Benchmarks:

GAIA (General AI Assistants (Web/Tool use))
BrowseComp (Advanced web navigation)
HLE (Humanities, Law, etc.) (Frontier academic problem-solving)
AIME 2025 (Mathematical reasoning)
LiveCodeBench v5 (Code generation)

Metrics:

Pass@1 (Success Rate)
Exact Match (EM)
LLM-as-a-Judge Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AFM establishes new SOTA performance on web agent benchmarks, outperforming both tool-integrated baselines and traditional multi-agent frameworks.
GAIA	Average Score	53.2	55.3	+2.1
BrowseComp	Success Rate	10.5	11.1	+0.6
HLE	Success Rate	15.8	18.0	+2.2
AIME 2025	Solve Rate	49.3	59.8	+10.5
Inference Cost Analysis	Token Consumption	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

AFM consistently outperforms specialized Tool-Integrated Reasoning (TIR) methods across diverse benchmarks (GAIA, HLE, BrowseComp) using the same backbone size.
The Multi-Agent Distillation framework effectively transfers the capabilities of complex systems (like OAgents) into a single model, overcoming the 'Tool Coordination Dilemma'.
Agentic RL further enhances performance, particularly on difficult queries where tool-based reasoning provides substantial value.
The approach is highly efficient, drastically reducing token usage compared to multi-agent systems by eliminating redundant inter-agent communication overhead.

📚 Prerequisite Knowledge

Prerequisites

Understanding of ReAct (Reasoning + Acting) prompting
Familiarity with Multi-Agent Systems (MAS) and inter-agent communication
Knowledge of Reinforcement Learning (RL) for LLMs, specifically PPO/DPO variants

Key Terms

CoA: Chain-of-Agents—a paradigm where a single model simulates multi-agent collaboration by dynamically activating different agent roles (e.g., Plan Agent, Search Agent) within one inference stream

AFM: Agent Foundation Model—the resulting model trained via CoA that supports native end-to-end complex problem solving

TIR: Tool-Integrated Reasoning—models trained to explicitly use tools (think-action-observation) but typically limited to a single agent perspective

Multi-Agent Distillation: The process of recording trajectories from a complex multi-agent system (like OAgents) and converting them into a linear sequence for supervised fine-tuning

DAPO: Dynamic Sampling Policy Optimization—an RL algorithm used here to optimize the agent policy

ReAct: Reasoning and Acting—a framework where LLMs generate reasoning traces and task-specific actions in an interleaved manner

OAgents: A state-of-the-art open-source multi-agent framework used as the 'teacher' system for generating distillation data

Pass@1: A metric measuring the percentage of problems where the model's first attempt is correct

GRM: Generative Reward Model—used here to assess credibility scores for error-correction filtering