ProxRouter: Proximity-Weighted LLM Query Routing for Improved Robustness to Outliers

📝 Paper Summary

LLM Routing Model Selection Efficient Inference

ProxRouter improves nonparametric query routing by using exponentially tilted, proximity-weighted aggregation of training examples to better estimate model performance for outlier queries without needing retraining.

Core Problem

Existing nonparametric routers (clustering or nearest-neighbor based) struggle to generalize to outlier queries—those unseen or rare in training data—leading to poor accuracy and cost estimates.

Why it matters:

Using frontier models for every query is prohibitively expensive, but cheaper models fail on hard queries; efficient routing is essential for cost-effective AI platforms.
Training sets for routers are costly to maintain and cannot cover all emerging use cases (e.g., new programming languages), causing performance drops when user queries drift from training distributions.

Concrete Example: A router trained on general reasoning tasks might see a new query type, like a code snippet in a rare language. A standard clustering router assigns it to a generic 'math' cluster, mispredicting difficulty. ProxRouter weighs closer neighbors higher, correctly routing it to a code-specialized model.

Key Novelty

Proximity-Weighted Aggregation with Bias-Variance Control

Generalizes standard clustering (K-Means) and nearest-neighbor (kNN) routers into a unified probabilistic framework where estimates are weighted averages of reference points.
Applies an 'exponential tilt' to these weights: reference points (clusters or neighbors) closer to the test query get exponentially higher influence, reducing bias for outliers while maintaining stability for inliers.

Architecture

Conceptual illustration of Proximity-Weighted Aggregation vs. Hard Assignment

Evaluation Highlights

+8.1% improvement in Area Under the Curve (AUC) for the nearest-neighbor setting (38.5% to 46.6%) on outlier math tasks (GSM8k, SVAMP).
Outperforms standard K-Means routing on Leave-Task-Out benchmarks (Hellaswag, MedQA), bringing performance closer to an 'AllSee' upper bound that trains on those tasks.
Achieves higher accuracy at lower costs than baselines by effectively identifying when to use fine-tuned specialized models instead of generic large models.

Breakthrough Assessment

7/10

Provides a mathematically grounded unification of nonparametric routers and significantly improves robustness to outliers without retraining. A solid, practical contribution to efficient inference systems.

⚙️ Technical Details

Problem Definition

Setting: Select model m from pool M to maximize objective U(m)(x) = acc(m)(x) - λ * cost(m)(x)

Inputs: Test query x represented as a fixed-dimensional encoding

Outputs: Selected model m* that maximizes the estimated objective

Pipeline Flow

Query Encoding
Reference Retrieval
Performance Estimation (ProxRouter Aggregation)
Model Selection

System Modules

Query Encoder

Convert input text query into a fixed-dimensional vector

Model or implementation: MPNet-base

Reference Retriever

Identify relevant training data points (clusters or neighbors)

Model or implementation: K-Means or kNN search

Estimator

Estimate accuracy and cost for every model in the pool using proximity-weighted aggregation

Model or implementation: ProxRouter Algorithm

Selector

Select the optimal model based on estimated objectives

Model or implementation: Argmax

Novel Architectural Elements

Proximity-weighted aggregation mechanism that dynamically adjusts weights based on query-reference distance during inference
Unified mathematical framework encompassing both K-Means and kNN routers as special cases (tau=0 and tau=infinity)

Modeling

Base Model: MPNet-base (for encoding)

Comparison to Prior Work

vs. Parametric Routers: ProxRouter is training-free and adapts to new data without retraining
vs. KMeans: ProxRouter uses soft, distance-weighted assignment instead of hard assignment to the nearest cluster
vs. kNN: ProxRouter weighs closer neighbors more heavily via exponential tilting rather than uniform averaging
+ 1 more
vs. RouterBench [not cited in paper]: Focuses on outlier robustness specifically rather than just general routing performance

Limitations

Relies on the quality of the embedding space; if the encoder (MPNet) fails to capture task relevance, routing will fail.
Inference overhead scales with the size of the reference set for kNN (though clustering mitigates this).
Requires hyperparameter tuning of tau (temperature) on held-out data.

Reproducibility

Code availability is not explicitly provided in the paper. The method relies on standard components (MPNet, K-Means, kNN) and the aggregation formula is fully specified. The dataset includes 14 open-source LLMs and 10 public datasets (MMLU, GSM8k, etc.).

📊 Experiments & Results

Evaluation Setup

Routing queries from 10 diverse datasets to a pool of 14 LLMs (7B to 70B parameters)

Benchmarks:

MMLU, ARC-C, Hellaswag, PIQA, Winogrande, MedQA, GSM8k, SVAMP, LogiQA, BoolQ (Reasoning, Knowledge, Arithmetic)

Metrics:

Area Under the Curve (AUC) of Accuracy-Cost plot
Accuracy at fixed cost
Cost at fixed accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ProxRouter (kNN-Prox) significantly improves routing performance on outlier arithmetic tasks compared to a standard kNN baseline.
Math Tasks (GSM8k, SVAMP)	AUC (Normalized Area Under Cost-Accuracy Curve)	38.5	46.6	+8.1
ProxRouter (KKM-Prox) improves upon standard K-Means routing in Leave-Task-Out scenarios where specific tasks are entirely missing from training clusters.
Hellaswag, MedQA (Outliers)	AUC	See qualitative description	See qualitative description	Positive improvement

Experiment Figures

Accuracy-Cost tradeoff curves for Leave-Task-Out scenarios (Hellaswag/MedQA and LogiQA/BBH/CSQA)

Impact of the temperature parameter (tau) on Bias and Variance

Main Takeaways

ProxRouter consistently improves outlier robustness across both clustering (KKMeans) and nearest-neighbor (kNN) backbones.
The method nearly matches the performance of an 'AllSee' oracle (trained on the test distribution) for outlier tasks, without actually seeing those tasks during training.
Exponential tilting effectively balances bias and variance: low tau (high bias, low variance) mimics nearest-neighbor, high tau (low bias, high variance) mimics uniform averaging.
Inlier performance is preserved; the method does not degrade performance on familiar tasks while improving on novel ones.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and inference costs
Sentence embeddings / encoders
Clustering (K-Means) and Nearest Neighbors (kNN)
Bias-Variance tradeoff in estimation

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

Nonparametric router: A routing system that estimates performance based on similarity to training examples rather than a trained neural network

Parametric router: A routing system that uses a trained neural network (e.g., MLP) to predict model performance

Exponential tilt: A reweighting technique that multiplies a base distribution by an exponential function of a feature (here, proximity) to shift probability mass

AUC: Area Under the Curve—here referring to the area under the accuracy-cost tradeoff curve

Inlier: A query that comes from a task or distribution well-represented in the training set

Outlier: A query from a task or distribution not present or poorly represented in the training set

Lagrangian relaxation: A method to convert a constrained optimization problem (max accuracy subject to cost) into an unconstrained one with a penalty parameter lambda

MPNet: A sentence embedding model used to convert text queries into vector representations