TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

📝 Paper Summary

Test-time scaling Tool-augmented reasoning

TUMIX improves reasoning accuracy by running a diverse ensemble of agents—combining text, code, and search capabilities—that iteratively share and refine answers, managed by an adaptive termination policy.

Core Problem

Single-agent LLMs struggle to balance text reasoning, coding, and search within a single context, often underusing tools or failing to identify the best solving strategy for ambiguous questions.

Why it matters:

Current Code Interpreter implementations often fail to invoke code when needed, relying too heavily on weaker textual reasoning
Questions rarely provide explicit cues on whether search, code, or pure reasoning is optimal, making single-strategy agents brittle
Existing test-time scaling methods (like Mixture-of-Agents) focus on scaling LLM count but neglect the diversity of tool-use strategies

Concrete Example: In complex benchmarks like HLE, a model might attempt to solve a math problem via text and fail due to calculation errors. A standard tool-use agent might stick to one strategy. TUMIX runs both text and code agents in parallel; the text agent's failure is corrected when it sees the code agent's successful execution in the next refinement round.

Key Novelty

Tool-Use Mixture (TUMIX) with Heterogeneous Agents

Runs 15+ diverse agents in parallel (some pure text, some using Python, some using Google Search/LLM-search) rather than cloning a single 'best' agent
Implements a 'message passing' refinement where agents update their answers based on the history of *all* other agents' outputs
Uses an LLM-based judge to dynamically stop refinement when confidence is high, saving compute compared to fixed-round methods

Architecture

The TUMIX framework showing parallel agents, iterative refinement loops, and aggregation.

Evaluation Highlights

+3.55% average accuracy improvement over best-performing test-time scaling baselines (Self-MoA, SciMaster) on Gemini-2.5 models
Reduces inference costs to ~49% of fixed-round baselines while maintaining accuracy via adaptive early termination
Raises Gemini-2.5-Pro accuracy on Humanity's Last Exam (HLE) from 21.6% to 34.1% via scaling

Breakthrough Assessment

8/10

Strong empirical results on very hard benchmarks (HLE, GPQA Diamond). Effectively combines tool use with the 'Mixture of Agents' scaling philosophy, addressing the tool-selection bottleneck by simply running all strategies and aggregating.

⚙️ Technical Details

Problem Definition

Setting: Sequential decision-making under a compute budget with diverse and correlated experts

Inputs: Natural language question q

Outputs: Final consensus answer a*

Pipeline Flow

Agent Pool Initialization (15 diverse strategies)
Parallel Generation (Round 1)
Iterative Refinement (Rounds 2-N)
Adaptive Termination Check
Final Answer Selection

System Modules

Agent Pool

Generate diverse candidate solutions using distinct strategies

Model or implementation: Gemini-2.5-Pro / Gemini-2.5-Flash

Refinement Aggregator

Distribute previous answers to all agents for context

Model or implementation: Deterministic routing

Termination Judge

Decide whether to stop refinement loops to save cost

Model or implementation: LLM-as-Judge (Gemini-2.5-Pro)

Final Selector

Select the final answer from the pool

Model or implementation: Majority Vote / Consistency Check

Novel Architectural Elements

Integration of tool-specific agents (Search vs Code vs Text) into a Mixture-of-Agents style communication loop
Self-optimizing agent design where the LLM generates new agent prompts to maximize ensemble diversity

Modeling

Base Model: Gemini-2.5-Pro and Gemini-2.5-Flash

Compute: Inference-only method. Adaptive termination reduces cost to ~49% of fixed-schedule baselines.

Comparison to Prior Work

vs. MoA: TUMIX uses a single LLM with *diverse tool strategies* rather than multiple different LLMs.
vs. Self-MoA: TUMIX proves diversity of agents (tools) outperforms repeated sampling of a single 'best' agent.
vs. SciMaster: TUMIX employs a broader range of distinct tool agents and a fully connected message-passing refinement graph.

Limitations

Diversity collapse: Agents tend to converge to a single answer (sometimes wrong) after multiple rounds.
Cost: Despite optimization, running 15 parallel agents is computationally expensive compared to standard inference.
Tool dependency: Performance gains rely heavily on the quality of underlying tools (Search API, Code Execution environment).

📊 Experiments & Results

Evaluation Setup

Reasoning benchmarks requiring text, code, and search capabilities.

Benchmarks:

Humanity's Last Exam (HLE) (Multi-subject academic reasoning)
GPQA Diamond (Expert-level science QA)
AIME 2024 & 2025 (High school competition math)

Metrics:

Accuracy (Success Rate)
Coverage (Probability that at least one agent is correct)
Statistical methodology: Results averaged over three independent runs.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TUMIX demonstrates significant scaling gains over the base model without test-time scaling.
HLE, GPQA, AIME (Average)	Accuracy	Not reported as a single aggregate number	Not reported as a single aggregate number	+7.8%
HLE, GPQA, AIME (Average)	Accuracy	Not reported as a single aggregate number	Not reported as a single aggregate number	+17.4%
TUMIX outperforms state-of-the-art test-time scaling baselines under equal compute budgets.
HLE, GPQA, AIME (Average)	Accuracy	Not reported as a single aggregate number	Not reported as a single aggregate number	+3.55%
Deep scaling on HLE shows TUMIX surpasses strong baselines including 'Deep Research' variants.
HLE	Accuracy	21.6%	34.1%	+12.5%
HLE	Accuracy	Not explicitly reported as single number	Not explicitly reported as single number	+1.2%

Experiment Figures

Evolution of coverage, individual accuracy, and average score over refinement rounds.

Dynamics of answer correctness on HLE over rounds.

Main Takeaways

Diversity beats repetition: A diverse group of tool-using agents consistently outperforms repeated sampling of a single 'best' agent.
Iterative refinement drives convergence: Accuracy improves in early rounds as agents share insights, but diversity collapses in later rounds, necessitating adaptive termination.
Cost efficiency: Adaptive termination based on LLM confidence preserves optimal accuracy while cutting inference costs by ~51%.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Model (LLM) inference and sampling
Familiarity with tool-use in LLMs (Code Interpreter, Search)
Basic knowledge of ensemble methods and majority voting

Key Terms

HLE: Humanity's Last Exam—a challenging multi-subject benchmark designed to test the limits of modern LLMs

GPQA: Graduate-Level Google-Proof Q&A—a dataset of difficult science questions requiring expert knowledge

AIME: American Invitational Mathematics Examination—a high-difficulty high school math competition

Test-time scaling: Improving model performance during inference (not training) by using more compute, such as generating multiple samples or iterative refinement

Code Interpreter: A tool allowing the LLM to write and execute Python code to solve problems

Mixture-of-Agents (MoA): A method where multiple LLM agents generate responses and then iteratively refine them by reading each other's outputs