GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by normalizing scores within a group of sampled completions, removing the need for a value function critic
Importance Sampling: A technique to estimate properties of a distribution (target) while sampling from a different distribution (proposal) by weighting samples by the ratio of their probabilities
Dense Prompt Packing: A system optimization that packs multiple short, pruned sequences into a single long buffer to maximize GPU compute utilization and avoid padding
Estimation Bias: The systematic error introduced when the expected value of a gradient estimator differs from the true gradient of the objective function
SXM: NVIDIA's high-bandwidth socket interconnect for GPUs, allowing faster communication than PCIe
MoE: Mixture-of-Experts—a model architecture where different parts of the network (experts) are activated for different inputs, often leading to sparsity
Pass@1: The percentage of problems where the model generates the correct answer in a single attempt
vLLM: A high-throughput library for LLM inference and serving
Forward-pass cost: The computational expense of generating text (rollouts) before the backpropagation (training) step