SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models

📝 Paper Summary

LLM Routing / Selection Efficient Inference Ensemble Methods

SelectLLM employs a lightweight multi-label classifier to route queries to the most suitable subset of LLMs, improving accuracy and reducing latency compared to querying all models.

Core Problem

Individual LLMs have diverse strengths and weaknesses, but querying a full ensemble of models to find the best answer is computationally expensive and slow.

Why it matters:

No single open-source LLM dominates across all benchmarks; different models excel at different reasoning tasks
Ensembling all available models increases inference costs and latency linearly with the number of models
Routing queries efficiently can unlock the collective intelligence of smaller models without the cost of giant monolithic models

Concrete Example: For a complex math problem, a general-purpose model might fail while a math-specialized model succeeds. A brute-force ensemble queries both (wasting resources), whereas SelectLLM identifies the math query and routes it only to the math specialist.

Key Novelty

Supervised Multi-Label Classification for LLM Routing

Trains a lightweight classifier (RoBERTa-based) on query-response pairs to predict which LLMs in a pool are likely to answer a specific query correctly
Uses confidence scores from this classifier to select a dynamic subset of models (rather than a fixed number or single model) for each query
Introduces a weighted majority voting scheme that adjusts vote strength based on the router's confidence in each selected model

Architecture

The SelectLLM inference workflow.

Evaluation Highlights

Reduces inference latency by 70% on MMLU and 13% on GSM8K compared to the top-performing baseline while maintaining or improving accuracy
Achieves +4.89% accuracy improvement on MMLU compared to strong ensemble baselines
Outperforms similarly sized top-performing LLM subsets, demonstrating that dynamic selection is superior to static model choices

Breakthrough Assessment

7/10

Strong practical results in efficiency and accuracy for LLM routing. The approach is straightforward but effective, addressing a major bottleneck in deploying ensembles.

⚙️ Technical Details

Problem Definition

Setting: Given a query q and a pool of LLMs L, select a subset L_s that maximizes the probability of a correct response while minimizing cumulative latency.

Inputs: Natural language query q

Outputs: Selected subset of LLMs L_s and their aggregated response

Pipeline Flow

Input Query -> Multi-Label Classifier (RoBERTa) -> Confidence Scores -> Selection Policy -> Selected LLM Subset -> Aggregation (Weighted Voting) -> Final Answer

System Modules

Multi-Label Classifier (MLC) (Routing)

Predicts which LLMs in the pool are likely to answer the query correctly

Model or implementation: RoBERTa-base (fine-tuned)

Selection Policy (Routing)

Selects the optimal subset of LLMs based on confidence scores

Model or implementation: Algorithm (WeightedMaxConf, MaxConf, or LabelledMaxConf)

LLM Pool (Inference)

Generate responses to the query

Model or implementation: Diverse pool of 6-7 open-source LLMs (e.g., Llama-2-13b, Mistral-7B)

Aggregator (Inference)

Combines responses into a final answer

Model or implementation: Weighted Majority Voting

Novel Architectural Elements

WeightedMaxConf policy: Modifies majority voting by dividing vote frequency by the square root of the router's confidence score to mitigate bias towards high-confidence but potentially wrong models

Modeling

Base Model: RoBERTa-base (for the router/classifier)

Training Method: Supervised Fine-Tuning (Multi-label classification)

Objective Functions:

Purpose: Minimize difference between predicted probabilities and actual successful LLMs.

Formally: Binary Cross Entropy Loss with Logits (BCEWithLogitsLoss)

Training Data:

SLData: Constructed by running all LLMs on training sets of GSM8K and MMLU. Labels = set of LLMs that answered correctly (Majority Vote @ M).

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: Not reported in the paper
epochs: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. FrugalGPT: SelectLLM selects a subset in parallel based on query content rather than a sequential cascade [not cited in paper]
vs. Existing Routing (e.g., Ding et al., 2024): SelectLLM focuses on selecting a *subset* for reasoning correctness rather than just the single smallest model for efficiency
vs. Brute-force Ensemble: Queries only a subset of models, significantly reducing compute

Limitations

Reliance on a fixed pool of LLMs; adding new models requires retraining the router
Performance depends heavily on the diversity and quality of the underlying LLM pool
Router training requires ground truth labels (correctness of LLM answers), which may be expensive to obtain for new domains

Reproducibility

Code: https://github.com/kaushal0494/SelectLLM

Code is publicly available at https://github.com/kaushal0494/SelectLLM. The paper details the logic for dataset creation (SLData) and the selection policies. Exact hyperparameters for training the RoBERTa classifier are not explicitly listed in the main text.

📊 Experiments & Results

Evaluation Setup

Evaluation on mathematical reasoning and multi-task language understanding using a pool of open-source LLMs.

Benchmarks:

GSM8K (Mathematical reasoning)
MMLU (Multi-task knowledge and reasoning)

Metrics:

Accuracy (%)
Inference Latency (seconds)
Statistical methodology: Majority Voting (maj@M) used to determine LLM correctness for ground truth. Results reported as accuracy scores.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SelectLLM consistently outperforms single best models and random baselines on accuracy while reducing latency.
MMLU	Accuracy	54.08	58.97	+4.89
GSM8K	Accuracy	53.60	55.50	+1.90
MMLU	Latency Reduction	Not reported in the paper	Not reported in the paper	70% reduction
GSM8K	Latency Reduction	Not reported in the paper	Not reported in the paper	13% reduction

Experiment Figures

Accuracy of LLM pool as number of models increases on GSM8K and MMLU.

Main Takeaways

Selective ensembling outperforms querying all models by avoiding noise from unsuitable models.
The WeightedMaxConf policy is the most effective, balancing confidence with a mechanism to prevent dominance by high-confidence but incorrect models.
Latency gains are massive on MMLU (70%) where the pool contains diverse models, allowing the router to skip many irrelevant ones.
Oracle analysis suggests significant headroom remains; perfect routing could yield much higher accuracy than current SelectLLM performance.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and ensembling
Basics of multi-label classification
Familiarity with reasoning benchmarks (GSM8K, MMLU)

Key Terms

LLM: Large Language Model—a deep learning algorithm that can recognize, summarize, translate, predict, and generate text

Multi-label classifier: A classification model that can predict multiple correct labels (in this case, multiple suitable LLMs) for a single input instance

Majority Voting: An ensemble method where the final answer is determined by the most frequent response among the selected models

RoBERTa: A robustly optimized BERT pretraining approach; a transformer-based model used here as the lightweight router backbone

Inference latency: The time taken by a model to process an input and generate an output

Oracle: A theoretical upper bound performance metric representing the accuracy if the system always perfectly selected the subset of models that contains the correct answer (if any exist)

GSM8K: Grade School Math 8K—a benchmark dataset of 8.5K high quality linguistically diverse grade school math word problems

MMLU: Massive Multitask Language Understanding—a benchmark measuring knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings