AOrchestra: Automating Sub-Agent Creation for Agentic Orchestration

📝 Paper Summary

Sub-agents-as-tools paradigm Agentic Orchestration

AOrchestra employs a central orchestrator that dynamically creates specialized sub-agents on the fly using a unified 4-tuple abstraction (Instruction, Context, Tools, Model) to solve complex long-horizon tasks.

Core Problem

Existing agentic systems rely on static sub-agent roles or rigid context-isolation threads, which lack the flexibility to handle the dynamic variety of subtasks in open-ended environments.

Why it matters:

Fixed roles require heavy human engineering and cannot cover emergent subtasks in open environments
Simple context isolation fails to specialize agent capabilities (tools and models) for specific subtasks
Lack of control over context routing leads to noisy over-sharing or harmful omission of critical information

Concrete Example: In a coding task requiring both file navigation and code editing, a static 'Coder' role might be overwhelmed with a huge codebase context. AOrchestra instead spawns a specific 'Navigator' sub-agent with only file-system tools and relevant context, then subsequently spawns an 'Editor' sub-agent with only the necessary file content and editing tools.

Key Novelty

On-demand Sub-Agent Specialization via 4-Tuple Abstraction

Models any agent as a dynamically instantiable tuple of <Instruction, Context, Tools, Model>, treating agents as compositional recipes rather than fixed roles
Decouples orchestration from execution: the Orchestrator does not execute tasks but focuses solely on synthesizing this 4-tuple to spawn disposable, specialized executors

Architecture

Comparison of AOrchestra's on-demand specialization vs. static roles/context isolation, and the detailed workflow of the Orchestrator delegating to Executors.

Evaluation Highlights

+16.28% relative improvement against the strongest baseline (OpenHands) on GAIA, SWE-Bench-Verified, and Terminal-Bench 2.0 when paired with Gemini-3-Flash
Supervised Fine-Tuning (SFT) of the Orchestrator improves pass@1 on GAIA by +11.51% over the base model
Cost-aware routing optimization reduces average cost by 18.5% on GAIA while improving pass@1 by +3.03% via in-context learning

Breakthrough Assessment

8/10

Strong conceptual shift from static roles to dynamic agent instantiation. Significant performance gains on top-tier benchmarks (GAIA, SWE-Bench) and demonstrates learnable orchestration for cost-efficiency.

⚙️ Technical Details

Problem Definition

Setting: Multi-turn agentic task solving where an orchestrator delegates subtasks to sub-agents via tool calls to maximize success and minimize cost

Inputs: User goal G and environment state s_t

Outputs: Actions a_t (either environment interactions via sub-agents or final answer)

Pipeline Flow

Orchestrator analyzes state -> Calls Delegate(Instruction, Context, Tools, Model)
System spawns Sub-Agent based on tuple -> Sub-Agent executes task -> Returns result
Orchestrator integrates result -> Repeats or calls Finish()

System Modules

Orchestrator

Decomposes global goal, curates context, selects tools/models, and creates sub-agents

Model or implementation: Gemini-3-Flash (base), also tested with GPT-4o

Executor (Sub-Agent)

Executes specific subtasks using assigned tools and context

Model or implementation: Dynamically selected (M_t) from available models

Novel Architectural Elements

Dynamic 4-tuple instantiation mechanism: Agents are created on-the-fly via arguments to a 'Delegate' tool rather than pre-loaded
Strict separation of System Tools (Orchestration) vs. Environment Tools (Execution) enforced by the architecture

Modeling

Base Model: Gemini-3-Flash (primary), GPT-4o, Claude-3.5-Sonnet (for comparisons)

Training Method: Supervised Fine-Tuning (SFT) for task orchestration; In-Context Learning for instruction optimization

Training Data:

Collected expert orchestration trajectories (s_t, a*_t) for SFT

Compute: Not reported in the paper

Comparison to Prior Work

vs. Claude Code: AOrchestra creates agents dynamically based on task needs rather than using fixed pre-defined specialists [not cited in paper]
vs. THREAD: AOrchestra emphasizes full specialization of the 4-tuple (including model routing and explicit tool selection), not just recursive decomposition
vs. MetaGPT: AOrchestra creates workflows on-the-fly via the orchestrator, avoiding rigid standard operating procedures (SOPs)

Limitations

Heavy reliance on the Orchestrator's ability to accurately decompose tasks; failure in delegation leads to task failure
Cost-performance trade-off relies on accurate routing, which may be unstable in zero-shot settings without learning
Latency may increase due to the overhead of context curation and sub-agent instantiation steps

Reproducibility

Code: https://github.com/FoundationAgents/AOrchestra

Code is publicly available at https://github.com/FoundationAgents/AOrchestra. Specific hyperparameters for SFT and instruction optimization steps are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Agentic task solving across coding, digital world, and shell environments

Benchmarks:

GAIA (General AI Assistants (digital world tasks))
SWE-Bench Verified (Software Engineering (coding))
Terminal-Bench 2.0 (Bash/Shell environment tasks)

Metrics:

Pass@1 (Success Rate)
Cost (USD)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main training-free results comparing AOrchestra against baselines using Gemini-3-Flash.
GAIA	Pass@1	39.88	46.34	+6.46
SWE-Bench Verified	Pass@1	43.60	50.40	+6.80
Terminal-Bench 2.0	Pass@1	39.60	46.20	+6.60
Learnable Orchestrator results showing improvements from Supervised Fine-Tuning (SFT) and In-Context Learning (ICL).
GAIA	Pass@1	46.34	57.85	+11.51
GAIA	Pass@1	46.34	49.37	+3.03

Experiment Figures

Radar chart comparing AOrchestra vs. OpenHands, AutoGen, and ReAct across three benchmarks (Terminal, GAIA, SWE-Bench).

Main Takeaways

Consistent performance gains across diverse environments (Web, Code, Terminal) validate the framework-agnostic nature of the 4-tuple abstraction
The Orchestrator is learnable: SFT significantly boosts task decomposition capabilities (+11.5% on GAIA)
Dynamic model routing achieves a better Pareto frontier, reducing costs significantly (-18.5%) while maintaining or improving performance

📚 Prerequisite Knowledge

Prerequisites

Agentic workflows (Orchestrator-Workers)
Tool use in LLMs (Function calling)
Context management in long-horizon tasks
Supervised Fine-Tuning (SFT)

Key Terms

4-tuple abstraction: The unified representation of an agent as <Instruction, Context, Tools, Model>, used to instantiate sub-agents dynamically

Orchestrator: The central agent that plans subtasks and creates sub-agents but never directly interacts with the environment's primary tools

SFT: Supervised Fine-Tuning—training a model on labeled examples (expert trajectories) to improve specific behaviors

Pareto-efficient: A state where performance cannot be improved without increasing cost (or vice versa); AOrchestra aims for this balance in model routing

In-context learning: Optimizing model behavior by modifying the prompt/instruction based on past experiences without changing model weights

System tools: Tools available only to the Orchestrator (Delegate, Finish), distinct from environment tools available to sub-agents