_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing rollouts within a group from the same prompt, avoiding a separate critic model
DAPO: Decoupled Clip and Dynamic Sampling Policy Optimization—a variant of GRPO that improves training stability
GAPO: Group Adaptive Policy Optimization—the proposed method that uses adaptive statistics (median of dense intervals) for advantage estimation
HDI: Highest-Density Interval—the narrowest interval containing a specified probability mass (e.g., 50% of points) in a distribution
SNR: Signal-to-Noise Ratio—used here to describe the relative clarity of the reward signal against the variance of rollouts
Exact Match: A metric measuring if the generated code is identical to the ground truth
Rollout: A complete sequence generated by the model (policy) given a prompt during RL training
Advantage: In RL, a value measuring how much better a specific action is compared to the average action in that state
Clipping ratio: The fraction of policy updates that are clipped (limited) to prevent the model from changing too drastically in one step