_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
KL-regularization: A technique that penalizes the policy for diverging from a reference policy (usually the pre-trained model) using Kullback-Leibler divergence.
RLHF: Reinforcement Learning from Human Feedback—a method to align language models with human intent using preference or rating data.
Contextual Bandits: A simplified RL setting where the agent observes a context, takes an action, and receives a reward, but actions do not affect future contexts (single-step RL).
Sample Complexity: The number of samples (interactions) required to learn an optimal policy within a specific error margin (ε).
Covering Number: A measure of the complexity of a function class (in this case, the reward function class), representing the number of balls needed to cover the space.
Reference Policy: The initial policy (e.g., a supervised fine-tuned LLM) used as a baseline for regularization to prevent overfitting.
Coverage Coefficient: A metric quantifying how well the reference policy explores the state-action space compared to an optimal policy; better coverage means the reference policy can generate good actions with non-zero probability.