RouteMoA: Dynamic Routing without Pre-Inference Boosts Efficient Mixture-of-Agents

📝 Paper Summary

Multi-Agent Collaboration Efficiency Optimization for LLMs

RouteMoA reduces Mixture-of-Agents computational costs by using a lightweight scorer to pre-filter models without inference, followed by a mixture of judges for refinement.

Core Problem

Existing Mixture-of-Agents (MoA) methods are computationally expensive because they require inference from all models before filtering or aggregating outputs.

Why it matters:

Standard MoA scales poorly: executing all models at every layer multiplies cost and latency, making large model pools infeasible
Sparse MoA attempts to filter responses but still requires full inference from all models first, failing to save compute on the actual generation step

Concrete Example: In a math query, standard MoA might invoke a biology model (Bio-Medical-Llama) and a math model (Qwen-Math). The biology model wastes compute generating a poor answer, which is then discarded. RouteMoA predicts the biology model is unsuitable purely from the query and never invokes it.

Key Novelty

Dynamic Routing without Pre-Inference (RouteMoA)

Introduces a lightweight scorer (SLM) that predicts model performance based on the query alone, filtering out weak models before they run
Refines these scores using a 'mixture of judges' that incorporates self-assessment (model confidence) and cross-assessment (peer review) from previous layers without extra inference

Architecture

The RouteMoA workflow across multiple layers, showing the interaction between the Scorer, Model Ranking, and Mixture of Judges.

Evaluation Highlights

Reduces inference cost by 89.8% and latency by 63.6% compared to standard MoA on a large-scale 15-model pool
Achieves 78.6% average accuracy on 30 datasets, outperforming standard MoA (71.3%) and Sparse MoA (69.7%)
Scorer achieves 97.9% Top-3 Hit Rate, effectively identifying high-potential models without running them

Breakthrough Assessment

8/10

Significantly improves the practicality of MoA by addressing its primary bottleneck (cost/latency) while maintaining or improving accuracy. The 'no pre-inference' routing is a crucial efficiency step.

⚙️ Technical Details

Problem Definition

Setting: Layer-wise selection of a subset of LLMs from a pool P to generate responses for query x, maximizing quality while minimizing cost

Inputs: User query x1 and a pool of N heterogeneous LLMs

Outputs: Final aggregated response generated by a synthesizer model in the last layer

Pipeline Flow

Score Acquisition (SLM Scorer prediction based on query)
Model Ranking & Selection (Top-k selection based on score/cost/latency)
Generation (Selected models generate responses)
Mixture of Judges (Refinement for next layer using self/cross-assessment)
Aggregation (Final layer synthesis)

System Modules

Scorer (Routing)

Predict coarse-grained performance scores for all candidate models based on query

Model or implementation: mDeBERTaV3-base (86M params) with projection head

Model Selector (Routing)

Select top-k models based on scores, cost, and latency

Model or implementation: Rule-based ranking algorithm

Mixture of Judges

Update model scores for subsequent layers using outputs from previous layers

Model or implementation: Weighted combination of Scorer, Self-Assessment, and Cross-Assessment

Novel Architectural Elements

Prediction-based routing (Scorer) that operates purely on query embeddings without requiring any LLM inference first
Mixture of Judges mechanism that combines prior (query-based) and posterior (output-based) signals to refine routing dynamically across layers

Modeling

Base Model: mDeBERTaV3-base for the Scorer; various LLMs (Qwen, Llama, Mistral) for the agent pool

Training Method: Dual Contrastive Learning for the Scorer

Objective Functions:

Purpose: Ensure embeddings of suitable models are closer to query embedding.

Formally: Sample-LLM contrastive loss maximizing similarity between query and top-K capable models vs bottom-K models.
Purpose: Ensure semantically similar queries have closer embeddings.

Formally: Sample-sample contrastive loss clustering similar queries.

Adaptation: Projection of LLM embeddings to 768-dim space

Trainable Parameters: Scorer embeddings and projection head

Training Data:

Queries collected from math, reasoning, coding, biomedical datasets
Ground-truth answers used to label model performance for training data

Key Hyperparameters:

alpha: 0.2
lambda: 0.5
learning_rate: 5e-5
+ 3 more
weight_decay: 0.01
batch_size: 64
clusters: 66

Compute: Experiments run on 80GB GPUs

Comparison to Prior Work

vs. MoA: RouteMoA selects a subset of models per layer instead of using all, reducing compute
vs. Sparse MoA: RouteMoA filters models *before* inference using a lightweight scorer, whereas SMoA generates first then filters
vs. RouteLLM: RouteMoA incorporates posterior knowledge (outputs) via mixture of judges, not just prior query routing

Limitations

Requires training a domain-specific scorer; performance depends on the quality/diversity of scorer training data
Cross-assessment adds some latency (though less than full inference)
Performance gain depends on the heterogeneity of the model pool (benefits most when models have distinct strengths)

Reproducibility

Scorer architecture (mDeBERTaV3-base) and loss functions are detailed. Hyperparameters (alpha, lambda, LR) provided. Code availability is not explicitly stated in the paper text or abstract.

📊 Experiments & Results

Evaluation Setup

Multi-agent generation across varying tasks and model pool sizes

Benchmarks:

MATH-500 (Math Reasoning)
ARC-Challenge (Reasoning)
MBPP (Code Generation)
MMLU-bio (Biomedical Knowledge)
AGIEval-Gaokao (Out-of-Distribution / General Exams)

Metrics:

Accuracy
Cost (USD)
Latency (seconds)
Statistical methodology: Paired t-test reported for small-scale pool comparison vs SMoA (p < 0.05)

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Large-scale experiments (15 models) demonstrate massive efficiency gains and higher accuracy compared to baselines that struggle to scale.
Average across 30 datasets	Accuracy	71.3	78.6	+7.3
Average across 30 datasets	Cost	Not explicitly reported in the paper as a single number but stated as 89.8% reduction	Not explicitly reported in the paper as a single number but stated as 89.8% reduction	Reduced by 89.8%
Small-scale experiments (5 models) confirm RouteMoA is efficient even with fewer options.
Average across 5 datasets	Cost (USD)	36.03	6.71	-29.32
Average across 5 datasets	Accuracy	81.9	83.1	+1.2
Ablation studies show the necessity of the mixture-of-judges components.
Average across 5 datasets	Accuracy	82.6	83.1	+0.5

Experiment Figures

Case study on RACE-high dataset showing how scores evolve across layers.

Main Takeaways

Dynamic routing based on query embeddings allows massive scaling of agent pools (up to 15 models tested) by avoiding the O(N) inference cost of standard MoA.
The 'Mixture of Judges' approach effectively corrects initial scorer errors by using model outputs (posterior info) to refine routing in later layers.
RouteMoA generalizes well to out-of-distribution tasks (AGIEval-Gaokao), outperforming SMoA while reducing cost and latency.
Cost reductions are dramatic (80-90%), making large-scale ensemble methods practical for real-world deployment.

📚 Prerequisite Knowledge

Prerequisites

Mixture-of-Agents (MoA) architecture
Embedding-based retrieval / routing
Contrastive learning

Key Terms

Mixture-of-Agents (MoA): A layered architecture where multiple LLMs generate responses, which are then aggregated and refined by subsequent layers of LLMs

Pre-inference: Executing a model to get its output before deciding whether to use it; RouteMoA avoids this to save cost

SLM: Small Language Model—used here as a lightweight scorer (86M parameters) to predict LLM performance

Self-assessment: A model evaluating its own confidence or output quality

Cross-assessment: One model evaluating the quality of another model's output

mDeBERTaV3-base: A small, pre-trained language model used to encode queries into embeddings for the scorer

Prior knowledge: Information available before model execution (e.g., query content)

Posterior knowledge: Information available after model execution (e.g., generated response, confidence score)