_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
LMM: Large Multimodal Model—models capable of processing and generating both text and other modalities like images
LLM: Large Language Model—models trained on vast text data to generate human-like text
OCR: Optical Character Recognition—capability to detect and read text within images
VL: Vision-Language—tasks or models involving both visual and textual information
LLM-based evaluator: Using a strong LLM (like GPT-4) to judge the quality of model outputs instead of fixed rule-based metrics
ViT: Vision Transformer—a model architecture for image processing based on the Transformer mechanism
CLIP: Contrastive Language-Image Pre-training—a model trained to match images with their corresponding text captions
Vicuna: An open-source chatbot trained by fine-tuning LLaMA on user-shared conversations
LLaMA: Large Language Model Meta AI—a foundational large language model released by Meta
One-shot: Providing a model with a single example of a task to guide its performance
Few-shot: Providing a model with a few examples of a task to guide its performance