_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
Chain-of-Thought (CoT): A prompting or training method where the model generates intermediate reasoning steps before the final answer to improve performance on complex tasks
SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, task-specific labeled dataset
tIoU: Temporal Intersection over Union—a metric measuring the overlap between the predicted time segment and the ground truth time segment
sIoU: Spatial Intersection over Union—a metric measuring the overlap between the predicted bounding box and the ground truth bounding box
TVL: Temporal Video Localization—finding the start and end times of an event in a video
VC: Video Captioning—generating a natural language description of the video
SVG: Spatial Video Grounding—locating an object in a specific video frame using a bounding box
STVG: Spatio-Temporal Video Grounding—tracking an object across multiple frames in both space (bounding box) and time
SRR: Spatial Relationship Reference—identifying spatial relationships between objects (e.g., 'A is behind B')
TVR: Temporal Video Reference—describing events that happen within a specific time interval
MENTOR: A metric used in this paper to evaluate the quality and relevance of generated captions and textual descriptions
Curriculum Learning: A training strategy where the model is exposed to easier examples (shorter reasoning chains) before harder ones (longer, complex chains)
Qwen2.5-VL: A specific family of large vision-language models used as the base model in this paper