pass@1: A metric measuring the percentage of tasks where the model's first generated solution is correct
strategy: A high-level plan or outline generated by an agent before attempting to solve a task, used here as a 'bid'
Shapley value: A concept from cooperative game theory used to attribute the marginal contribution of each agent to the total system performance
entropy: A measure of randomness/information content; here used as a heuristic for strategy quality (higher entropy in reasoning often correlates with better information)
greedy decoding: A decoding strategy where the model always selects the highest-probability next token
Pareto frontier: The set of optimal trade-offs where no metric (e.g., cost) can be improved without sacrificing another (e.g., accuracy)
Qwen3: A family of open-weight language models ranging from smaller (4B) to larger (32B) parameter counts used as the agent backbone
HST-Bench: Human Solution Time Benchmarkβa dataset proposed in this paper using human solution time as a proxy for task complexity