_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
DPO: Direct Preference Optimization—an algorithm that optimizes language models to align with human preferences by solving for the optimal policy directly without an explicit reward model loop
RLHF: Reinforcement Learning from Human Feedback—a technique to align models using human preference data, typically involving a reward model and PPO
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used in RLHF to update the model policy
Importance Sampling: A statistical technique used to estimate properties of a target distribution using samples from a different distribution by reweighting them
Contrastive LLMs: A pair of language models where one is biased towards generating preferred (winning) responses and the other towards non-preferred (losing) responses
Forward/Backward DPO: A method to create contrastive models: Forward trains on correct preferences (y_w > y_l), Backward trains on swapped preferences (y_l > y_w) to create a 'bad' model
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution
Bradley-Terry model: A statistical model that predicts the probability of one item being preferred over another based on their latent reward scores
Partition function: A normalization factor in probability distributions, often denoted as Z(x)
IPO: Identity Preference Optimization—a DPO variant adding a regularization term to the loss
KTO: Kahneman-Tversky Optimization—a preference optimization method based on prospect theory that defines utility directly on outputs
ROUGE: Recall-Oriented Understudy for Gisting Evaluation—a set of metrics used to evaluate automatic summarization and machine translation