_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
RLVR: Reinforcement Learning from Verifiable Rewards—an RL approach where rewards are based on objective, checkable criteria (like correct formatting or correct final answer) rather than a learned reward model
GRPO: Group Relative Policy Optimization—an RL algorithm that updates the policy based on the relative performance of a group of outputs generated for the same input, reducing gradient variance
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
RODS: Reasoning-Oriented Data Strategy—the paper's method of combining curated QA data with synthetic data generated from knowledge graphs to improve reasoning coverage
Knowledge Graph: A structured representation of knowledge where entities (nodes) are connected by relationships (edges), used here to generate synthetic medical questions
Distillation: Training a smaller student model to mimic the behavior or outputs of a larger, more capable teacher model
Pass@k: A metric measuring the probability that at least one of the top k generated solutions is correct
SFT: Supervised Fine-Tuning—training a model on labeled examples to adapt it to a specific task
RLHF: Reinforcement Learning from Human Feedback—optimizing a model using rewards derived from human preferences
PPO: Proximal Policy Optimization—a standard RL algorithm; GRPO is a variant of this that avoids using a separate value function critic
Hard-sample mining: A strategy of identifying and prioritizing training examples where the model frequently fails
Rollouts: Complete trajectories or sequences generated by the model during the RL exploration phase