_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
RLHF: Reinforcement Learning from Human Feedback—training method using a reward model to guide LLM outputs
Reasoning Models: LLMs specialized in complex multi-step tasks (e.g., math, coding) often trained with process supervision
Reference-based reward: A reward signal derived by comparing a generated answer against a known gold-standard answer (reference)
Pairwise preference: Traditional reward modeling where the model ranks two responses (A > B) rather than scoring absolute correctness
LLM-as-a-judge: Using a powerful LLM to evaluate the quality or correctness of another model's output
Meta-annotator: An experienced human annotator who resolves disagreements between initial annotators to ensure ground truth quality