DPO: Direct Preference Optimization—a method to align models to preferences by optimizing the likelihood of chosen responses over rejected ones without a separate reward model
Self-Rewarding: A paradigm where the model acts as both the generator of responses and the judge (evaluator) to create its own training data
SFT: Supervised Fine-Tuning—the initial training phase where a model learns to follow instructions from labeled examples
LLM-as-a-Judge: Using a Large Language Model to evaluate and score the quality of text, often replacing human annotation
Gradient Collapse: A phenomenon where the training signal (gradient) approaches zero because the model views chosen and rejected samples as equally likely or similar
Anchored Rejection: The strategy of using responses from a fixed initial model as negative samples throughout training to prevent the 'rejected' baseline from improving too much
Future-Guided Chosen: The strategy of generating positive samples using a temporary model trained one step ahead, providing a better target for the current model
AlpacaEval: A benchmark for evaluating instruction-following models using an LLM-based automatic evaluator
Arena-Hard: A challenging benchmark derived from Chatbot Arena data to evaluate models on complex queries