GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of outputs for the same prompt to reduce variance, avoiding the need for a separate value network.
SFT: Supervised Fine-Tuning—training a model on labeled examples (prompt-response pairs) to instill desired behaviors before RL.
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer.
pass@K: A metric measuring the probability that at least one of K generated solutions is correct.
temperature-adjusted entropy: A metric used to monitor the randomness of the model's policy during RL, calculated as entropy divided by the sampling temperature.
overlong filtering: A strategy during RL where samples that fail to produce a final answer within the token budget are masked out (ignored) rather than penalized.
exposure bias: The discrepancy between training (where the model sees ground truth) and inference (where it sees its own predictions), often mitigating by generating longer sequences.
on-policy: RL training where the data used for updates is generated by the current version of the policy being optimized.
rope_theta: A parameter in RoPE (Rotary Positional Embeddings) that controls the wavelength of position encodings; increasing it allows models to handle longer context windows.
DeepSeek-R1: A frontier reasoning model used here as a teacher to generate synthetic SFT data.
distillation: Training a smaller student model to mimic the outputs of a larger teacher model.