_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
CITA: Contrastive Instruction-Tuned Alignment—the proposed training method that conditions preference optimization on alignment instructions.
DPO: Direct Preference Optimization—a method to align language models to preferences without a separate reward model.
PPO: Proximal Policy Optimization—an RL algorithm used for aligning models via policy gradients.
GRPO: Group Relative Policy Optimization—an RL method that uses group-based relative rewards.
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution.
SFT: Supervised Fine-Tuning—training a model on a dataset of high-quality instruction-response pairs.
ECLIPTICA: The proposed benchmark containing 3,000 test cases where the prompt is held fixed and alignment instructions vary.
AQI: Alignment Quality Index—a metric measuring the intrinsic alignment signal of a model.
RLHF: Reinforcement Learning from Human Feedback—a standard pipeline for aligning LLMs using human preferences.