_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
RLVR: Reinforcement Learning from Verifiable Rewards—training models using RL where the reward is determined by an objective, programmatic check (like unit tests or math answers)
Rubric R: A set of K distinct critic dimensions, each with a criterion description, score tiers, and weight, used to evaluate model outputs
Reward Hacking: When a model exploits loopholes in the reward function to get high scores without actually solving the task (e.g., being sycophantic)
Seesaw Effect: The phenomenon where improving performance on one task type (e.g., creativity) degrades performance on another (e.g., instruction following) when trained jointly
Qwen3-30B-A3B: The specific base Large Language Model (LLM) used in this paper, originating from the Qwen series
DeepSeek-V3: A large-scale Mixture-of-Experts model used as a strong baseline for comparison
MMLU: Massive Multitask Language Understanding—a benchmark measuring general knowledge across 57 subjects
AIME: American Invitational Mathematics Examination—a challenging math benchmark used to evaluate reasoning
IFEval: Instruction Following Evaluation—a benchmark measuring how well models follow verifiable constraints