_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
unfamiliar inputs: Queries asking about concepts or entities that are not present or well-represented in the model's pre-training data.
unfamiliarity score: A metric quantifying how unknown a query is to the model, typically measured by the pre-trained model's few-shot performance or likelihood on that query.
conservative reward models: Reward models trained to avoid overestimating the quality of responses to unfamiliar queries, often by treating prediction on unfamiliar data as a distinct optimization target.
SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, task-specific labeled dataset.
RL: Reinforcement Learning—a training method where an agent learns to make decisions by receiving rewards or penalties.
PPO: Proximal Policy Optimization—a specific reinforcement learning algorithm used to update the language model policy.
MMLU: Massive Multitask Language Understanding—a benchmark dataset testing knowledge across many subjects.
TriviaQA: A reading comprehension dataset containing question-answer pairs with evidence documents.
reward model hallucinations: Instances where the reward model assigns a high score to a factually incorrect response generated by the LLM.
intelligent blind guess: The response distribution that minimizes aggregate loss over a set of unfamiliar examples without relying on specific input features (e.g., always guessing 'C' or always saying 'I don't know').