Scaling Small Agents Through Strategy Auctions

📝 Paper Summary

Multi-agent coordination Agentic workflow optimization Model routing

SALE is a marketplace-inspired framework where heterogeneous agents bid with strategic plans to win tasks based on cost-value scoring, refining their bids over time using auction memory to improve small-agent performance.

Core Problem

Small agents perform well on simple tasks but degrade on complex long-horizon ones, while always using large agents is cost-inefficient; existing predictive routers struggle with agentic workflows.

Why it matters:

Applying large models to every task is prohibitively expensive for long-horizon agentic workflows involving thousands of tokens.
Simple routing based on task descriptions fails because short prompts don't capture the complexity of the required reasoning trajectory.
Static routers do not allow smaller agents to improve or adapt to the workload distribution over time.

Concrete Example: On a complex coding task taking humans ~1 hour, the smallest agent (4B) achieves only ~17% of the largest agent's success rate. However, a router that just picks the largest model wastes money on the ~92% of simple tasks that the 4B model could solve perfectly well.

Key Novelty

Strategy Auctions for Workload Efficiency (SALE)

Agents 'bid' for tasks by generating short strategic plans rather than full solutions; these plans are scored for cost (length) and value (entropy + peer/self-review).
Uses an auction mechanism where smaller agents can 'upskill' by retrieving past successful strategies from a shared memory and refining their bids before the final winner is chosen.

Evaluation Highlights

Reduces reliance on the largest agent by 53% and overall cost by 35% across deep search and coding tasks compared to using the largest agent alone.
Consistently improves upon the largest agent's pass@1 accuracy (+3.5% on deep search, +2.7% on coding) despite lower costs.
Outperforms established predictive routers (Willingness-to-Pay, CARROT) which either fail to reduce costs significantly or degrade performance on complex tasks.

Breakthrough Assessment

8/10

Strong conceptual novelty in applying auction theory to agent routing. Demonstrates that small agents can be 'scaled up' at test time via strategy refinement, shifting the Pareto frontier beyond any single model.

⚙️ Technical Details

Problem Definition

Setting: Task allocation in a heterogeneous multi-agent system to minimize cost-minus-value

Inputs: Natural language task t and a set of available agents A = {a_1, ..., a_n}

Outputs: Selection of a winning agent a_i and execution of their strategy s_i to produce a final trace

Pipeline Flow

Strategy Bidding (Agents generate plans) → Cost & Value Assignment (Scoring) → Provisional Winner Selection → Strategy Refinement (Memory Retrieval & Re-bidding) → Final Selection → Execution

System Modules

Bidding Agents

Generate initial strategic plans (s_{t,i}) for the task without executing full traces

Model or implementation: Qwen3 family (4B, 8B, 14B, 32B)

Jury

Score the quality of proposed strategies

Model or implementation: Ensemble of all agents A

Auction Memory

Store past winning/losing strategies to help smaller agents refine bids

Model or implementation: Vector database (implied)

Executor

Execute the final selected agent with its winning strategy

Model or implementation: Selected Qwen3 agent

Novel Architectural Elements

Auction-based routing mechanism where 'bids' are natural language strategies rather than confidence scores
Feedback loop where auction outcomes (wins/losses) are stored in memory to refine future bids of cheaper agents at test time

Modeling

Base Model: Qwen3 (4B, 8B, 14B, 32B variants)

Training Method: Test-time optimization via scoring weights

Objective Functions:

Purpose: Minimize the worst-case cost-minus-value over a training set of tasks.

Formally: min_{w,x,Q} Q s.t. z_t <= Q, where z_t is cost-minus-value of chosen strategy.

Key Hyperparameters:

inference_decoding: Greedy
input_to_output_token_ratio_assumption: 4:1

Compute: Negligible overhead (hundreds of tokens) compared to full trace execution (thousands/millions of tokens)

Comparison to Prior Work

vs. WTP/CARROT/TO-Router: SALE routes based on generated strategic plans rather than just task descriptions, capturing reasoning complexity better.
vs. FrugalGPT: SALE selects the agent *before* full execution (predictive) based on strategy bids, avoiding the cost of running multiple full trajectories.
vs. Autogen [not cited in paper]: SALE focuses on economic/auction-based selection logic rather than conversational turn-taking patterns.

Limitations

Relies on the assumption that plan quality correlates strongly with execution quality.
Requires a diverse pool of agents to function effectively (studied here with 4 sizes of Qwen3).
Performance gains depend on the existence of complementary failure modes between agents.
Auction overhead, while small relative to execution, adds latency compared to a single-model call.

Reproducibility

Benchmark dataset (HST-Bench) and method details provided. Models are open-weights (Qwen3). Exact code URL not provided in paper text. Paper relies on Agent Research Environment (ARE) framework.

📊 Experiments & Results

Evaluation Setup

Agentic workflows in Deep Search and Coding environments

Benchmarks:

HST-Bench (Deep Search) (Complex QA/Reasoning (SimpleQA, PopQA, HotpotQA, GAIA, Humanity's Last Exam)) [New]
HST-Bench (Coding) (Code generation (MBPP, LeetCode, custom)) [New]

Metrics:

Pass@1
Price per million tokens ($/Mt)
Trace length
Statistical methodology: Reported averages over 5 independent random permutations of the test set (due to memory order dependence)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Deep Search results: SALE improves accuracy while reducing costs compared to the best single agent (32B).
Deep Search (All)	Pass@1	63.8	67.3	+3.5
Deep Search (All)	$/Mt	0.36	0.21	-0.15
Coding results: SALE achieves higher accuracy at lower cost than the best single agent.
Coding (All)	Pass@1	58.4	61.1	+2.7
Coding (All)	$/Mt	0.36	0.27	-0.09
Comparison against other routers shows SALE is the only one to simultaneously improve accuracy and reduce cost.
Deep Search (All)	Pass@1	63.0	67.3	+4.3
Coding (All)	Pass@1	50.4	61.1	+10.7

Main Takeaways

Small agents' performance degrades severely as task complexity (human solution time) increases, dropping to ~17-25% relative to large agents on hardest tasks.
SALE extends the Pareto frontier, achieving better accuracy/cost trade-offs than any single model or existing router.
The memory mechanism allows the smallest agent (4B) to increase its workload share over time (e.g., from 1.4% to 5.3% in coding), effectively 'scaling up' through experience.
Qualitative analysis shows failures are complementary: large agents sometimes over-complicate simple tasks, while small agents (via refined plans) stick to reliable tool use.

📚 Prerequisite Knowledge

Prerequisites

LLM-based agents (reasoning, tool use)
Auction theory / Mechanism design concepts
Model routing / cascading

Key Terms

pass@1: A metric measuring the percentage of tasks where the model's first generated solution is correct

strategy: A high-level plan or outline generated by an agent before attempting to solve a task, used here as a 'bid'

Shapley value: A concept from cooperative game theory used to attribute the marginal contribution of each agent to the total system performance

entropy: A measure of randomness/information content; here used as a heuristic for strategy quality (higher entropy in reasoning often correlates with better information)

greedy decoding: A decoding strategy where the model always selects the highest-probability next token

Pareto frontier: The set of optimal trade-offs where no metric (e.g., cost) can be improved without sacrificing another (e.g., accuracy)

Qwen3: A family of open-weight language models ranging from smaller (4B) to larger (32B) parameter counts used as the agent backbone

HST-Bench: Human Solution Time Benchmark—a dataset proposed in this paper using human solution time as a proxy for task complexity