_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
VLLM: Vision-Language Large Model—a model capable of processing and generating both image and text data
VIT: Visual Instruction Tuning—the process of fine-tuning VLLMs on instruction-following data to align them with user intent
SFT: Supervised Fine-Tuning—training a model on labeled examples (here, image-text pairs)
Conditional Affirmation Shift: The logarithmic ratio of the probability of the model outputting 'Yes' (valid answer) with the question present vs. without the question
Conditional Rejection Shift: The logarithmic ratio of the probability of the model outputting 'No' (invalid answer) with the question present vs. without the question
Linguistic Shortcut: When a model answers a question based on text patterns or priors (e.g., 'Is the sky blue?' -> 'Yes') rather than visual evidence
Semantic Conflict: A mismatch between the image, question, and answer (e.g., hallucinations or irrelevant responses)
COINCIDE: A clustering-based data selection method that groups samples based on joint representations from multiple layers
XMAS: A data selection method that clusters samples based on cross-modal attention trajectories
Zero-shot evaluator: Using a pre-trained model to judge quality without any specific training for the evaluation task
Vision-Flan: A widely used visual instruction tuning dataset containing diverse tasks
The Cauldron: A highly heterogeneous visual instruction tuning dataset compiled from various sources