_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
VLM: Vision-Language Model—an AI model capable of processing and generating both images and text.
RM: Reward Model—a model trained to predict human preference scores for generated outputs.
MPO: Mixed Preference Optimization—a training method that optimizes models using preference pairs from multiple sources or domains.
RLHF: Reinforcement Learning from Human Feedback—a technique to align AI models with human values using reward signals.
DPO: Direct Preference Optimization—an algorithm for aligning language models to preferences without an explicit reward model.
ViT: Vision Transformer—a model architecture that applies the Transformer mechanism directly to sequences of image patches.
Qwen2.5-VL: A specific open-source Vision-Language Model developed by Alibaba Cloud.
InternVL: A series of open-source Vision-Language Models.
Deepseek R1: A large language model known for strong reasoning capabilities.