Evaluation Setup
Multi-step agent tasks involving tool use, web browsing, and research
Benchmarks:
- TAU-Bench (Tool-use agents (Retail and Airline domains))
- BrowseComp-Plus (Deep-research / Web browsing)
- WebArena (Web agents)
Metrics:
- Success Rate (SR)
- Reasoning Token Usage (Cost)
- Pass Rate
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| TAU-Bench |
Reasoning Token Reduction |
0.0 |
52.7 |
-52.7
|
| gpt-oss-20b Agent (Intro Analysis) |
Success Rate Drop |
0.0 |
-20.0 |
-20.0
|
Main Takeaways
- Ares successfully decouples reasoning effort from model selection, enabling significant token savings (up to 52.7%) without the latency overhead of model switching.
- The 'verify-then-label' data synthesis pipeline is crucial for identifying the true minimal effort required, as it isolates step-wise difficulty from trajectory error propagation.
- RL (GRPO) further optimizes the router beyond SFT by explicitly balancing the reward signals of task success and token cost.