LLM: Large Language Model—a deep learning algorithm that can recognize, summarize, translate, predict, and generate text
Multi-label classifier: A classification model that can predict multiple correct labels (in this case, multiple suitable LLMs) for a single input instance
Majority Voting: An ensemble method where the final answer is determined by the most frequent response among the selected models
RoBERTa: A robustly optimized BERT pretraining approach; a transformer-based model used here as the lightweight router backbone
Inference latency: The time taken by a model to process an input and generate an output
Oracle: A theoretical upper bound performance metric representing the accuracy if the system always perfectly selected the subset of models that contains the correct answer (if any exist)
GSM8K: Grade School Math 8K—a benchmark dataset of 8.5K high quality linguistically diverse grade school math word problems
MMLU: Massive Multitask Language Understanding—a benchmark measuring knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings