_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
DPO: Direct Preference Optimization—a method to align language models to preferences without explicitly training a reward model
RLHF: Reinforcement Learning from Human Feedback—a framework involving training a reward model and then optimizing a policy using RL (e.g., PPO)
PPO: Proximal Policy Optimization—a standard RL algorithm used in RLHF
Lagrange multiplier: A variable (λ) used in constrained optimization to weigh the constraint violation (cost) against the objective (reward)
Primal-Dual: An optimization approach that simultaneously updates the policy (primal variable) and the Lagrange multiplier (dual variable)
SFT: Supervised Fine-Tuning—the initial training phase of LLMs on high-quality instruction data
Bradley-Terry model: A statistical model predicting the probability that one item is preferred over another based on their latent scores
Safe RLHF: A baseline framework that trains distinct reward and cost models and uses PPO-Lagrangian to align LLMs
C-DPO: Constrained DPO—a baseline method that modifies DPO for safety constraints, often by reordering preferences