_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
_example: {'RAG': 'Retrieval-Augmented Generation—AI systems that answer questions by first searching for relevant documents', 'F1 score': 'A metric balancing precision (are answers correct?) and recall (are answers complete?)', 'PPO': 'Proximal Policy Optimization—a reinforcement learning algorithm that updates a policy in small, stable steps using a clipped objective', 'parameter sharing': 'Multiple agents use the same underlying model weights, reducing memory and enabling coordination', 'warm start': 'Pre-training each module on labeled examples before switching to reinforcement learning, so agents start from a competent baseline'}
RSP: Responsible Scaling Policy—a framework where AI labs commit to specific safety measures when their models reach certain capability thresholds
MLM: Masked Language Modeling—a training objective where the model predicts masked tokens in a sequence
Triton: A language and compiler for writing highly efficient custom GPU kernels
PyTorch: A popular open-source machine learning library for Python
scaffolding: The software infrastructure wrapping an LLM that allows it to execute code, read files, and interact with an environment
desiderata: Desired properties or criteria that the evaluation suite aims to satisfy (e.g., low floor, high ceiling)
baselining: Establishing a standard of performance (usually human expert performance) against which new models can be compared
continuous metric: A scoring system that rewards incremental improvements (e.g., 10% faster code) rather than just pass/fail
QA: Question Answering—a task where the model answers questions based on context
GPT-2: Generative Pre-trained Transformer 2—an earlier generation language model used here as a target for finetuning tasks
zombie processes: Processes that have completed execution but still have an entry in the process table, often wasting resources