_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
Hallucination: When a language model generates false information or answers not substantiated by evidence
Calibration: The alignment between a model's predicted confidence (e.g., 'I am 80% sure') and its actual accuracy (being correct 80% of the time)
Adversarial collection: Data collection method where annotators specifically create examples that cause a target model (here, GPT-4) to fail
F-score: In this paper, the harmonic mean of 'overall correct' percentage and 'correct given attempted' percentage
Frontier models: The most capable, state-of-the-art large language models currently available (e.g., GPT-4o, Claude 3.5 Sonnet)
Saturated benchmarks: Benchmarks where model performance has reached a ceiling (e.g., near human level), rendering them less useful for distinguishing between new, better models
AI trainers: Human annotators employed to create data and grade model outputs
Temperature 1: A sampling parameter for LLMs; setting it to 1 introduces randomness, allowing the model to generate different answers on repeated attempts
String match: Comparing two text strings character-by-character to see if they are identical