PPO: Proximal Policy Optimization—an RL algorithm that updates a policy in stable steps to maximize a reward signal
MRR: Mean Reciprocal Rank—a statistical measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness
Passage Ranking: The task of ordering text passages based on their relevance to a specific query
LLM Routing: Selecting the best Large Language Model to handle a specific user query based on performance and cost trade-offs
Iterative Decoding: Generating a result step-by-step where the output of one step modifies the input for the next, here used to eliminate candidates one by one
SOTA: State-of-the-Art—the current best performance achievable by existing methods
GSM8K: Grade School Math 8K—a dataset of grade school math word problems used to benchmark reasoning
FSDP: Fully Sharded Data Parallelism—a technique to train large models by distributing parameters, gradients, and optimizer states across GPUs
Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer