PPO: Proximal Policy Optimization—a reward-based RL method that explicitly learns a reward model and optimizes a policy using actor-critic updates.
DPO: Direct Preference Optimization—a reward-free method that optimizes the policy directly on preference data by deriving a closed-form solution for the optimal policy.
OOD: Out-of-Distribution—data samples (prompts or responses) that differ significantly from the training distribution.
SFT: Supervised Fine-Tuning—the initial phase of training an LLM on high-quality demonstration data before alignment.
EMA: Exponential Moving Average—a technique used here to update the reference model slowly, stabilizing training.
advantage normalization: Rescaling the advantage estimates in PPO to have zero mean and unit variance, stabilizing the policy gradient updates.
CodeContest: A challenging competitive programming dataset used for benchmarking code generation capabilities.
KL divergence: A statistical distance measure used to regularize the aligned model so it does not deviate too far from the base reference model.