_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
CoT: Chain-of-Thought—a prompting technique where the model is asked to generate step-by-step reasoning before providing the final answer
Zero-shot: Evaluating a model without providing any specific training examples for the task in the prompt context
MedQA: A benchmark dataset consisting of multiple-choice questions from medical licensing exams (US, China, Taiwan)
USMLE: United States Medical Licensing Examination—a three-step examination for medical licensure in the U.S.
VQA-RAD: Visual Question Answering in Radiology—a dataset of clinical questions about radiology images
MedXpertQA: A challenging benchmark for expert-level medical knowledge and reasoning, containing both text-only and multimodal subsets
MMLU: Massive Multitask Language Understanding—a broad benchmark covering 57 subjects; this paper uses the Medical subset
Hallucination: In AI, generating plausible-sounding but factually incorrect information
Multimodal: Involving multiple types of data inputs, such as text and images simultaneously