HELM: Holistic Evaluation of Language Models—a framework for evaluating LMs across a wide range of tasks using standardized metrics
DSPy: A framework for programming with language models that separates program flow from prompt optimization, allowing automatic refinement of instructions and demonstrations
CoT: Chain-of-Thought—a prompting technique where the model is instructed to generate intermediate reasoning steps before the final answer
BFRS: Bootstrap Few-Shot with Random Search—an optimization algorithm that selects the best few-shot examples by generating candidates and randomly searching for high-performing sets
MIPROv2: Multi-prompt Instruction Proposal Optimizer v2—a Bayesian optimizer that jointly searches for the best instructions and few-shot examples using a proposer LM
Performance Ceiling: The maximum achievable performance of a model on a task, approximated here by optimizing the prompt
Bootstrapping: The process of using a model to generate its own training examples (demonstrations) by filtering for correct outputs on a training set
TV distance: Total Variation distance—a measure of the difference between two probability distributions
Decision Margin: The gap in probability mass between the top predicted class and the second-best class