Data Flywheel: A self-reinforcing loop where a model generates data that trains a better model, which then generates even better data
Elo rankings: A rating system calculated from win/loss results in head-to-head battles, used to quantify relative skill levels
SFT: Supervised Fine-Tuning—training a model to mimic high-quality reference answers
DPO: Direct Preference Optimization—an algorithm that aligns models to preferences (A > B) without a separate reward model
PPO: Proximal Policy Optimization—a reinforcement learning algorithm that optimizes a policy using a reward model and a clipped objective
WizardArena: The paper's proposed offline test set and evaluation pipeline that uses an AI judge to predict Elo rankings
Judge Model: A powerful LLM (here, Llama-3-70B-Chat) used to evaluate responses and declare a winner, simulating human judgment