Evaluation Setup
Agentic task solving across coding, digital world, and shell environments
Benchmarks:
- GAIA (General AI Assistants (digital world tasks))
- SWE-Bench Verified (Software Engineering (coding))
- Terminal-Bench 2.0 (Bash/Shell environment tasks)
Metrics:
- Pass@1 (Success Rate)
- Cost (USD)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Main training-free results comparing AOrchestra against baselines using Gemini-3-Flash. |
| GAIA |
Pass@1 |
39.88 |
46.34 |
+6.46
|
| SWE-Bench Verified |
Pass@1 |
43.60 |
50.40 |
+6.80
|
| Terminal-Bench 2.0 |
Pass@1 |
39.60 |
46.20 |
+6.60
|
| Learnable Orchestrator results showing improvements from Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). |
| GAIA |
Pass@1 |
46.34 |
57.85 |
+11.51
|
| GAIA |
Pass@1 |
46.34 |
49.37 |
+3.03
|
Main Takeaways
- Consistent performance gains across diverse environments (Web, Code, Terminal) validate the framework-agnostic nature of the 4-tuple abstraction
- The Orchestrator is learnable: SFT significantly boosts task decomposition capabilities (+11.5% on GAIA)
- Dynamic model routing achieves a better Pareto frontier, reducing costs significantly (-18.5%) while maintaining or improving performance