_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
SOTA: State-of-the-Art—the current best performing models or methods
LLM: Large Language Model—a deep learning model trained on vast amounts of text to generate human-like language
Jailbreak: A method to bypass an AI model's safety filters or ethical guidelines to generate prohibited content
Meta-evaluation: The process of using one or more LLMs to evaluate the output quality of another LLM
Spearman correlation: A statistical measure (ρ) of the strength and direction of association between two ranked variables
Cohen's Kappa: A statistic (κ) used to measure inter-rater reliability for qualitative items, correcting for chance agreement
GRUEN: A reference-less metric for evaluating the linguistic quality of generated text (Grammaticality, Non-redundancy, Focus, Structure)
Temperature: A hyperparameter controlling the randomness of LLM predictions; higher values make output more diverse
Top_p: Nucleus sampling—a decoding strategy that samples from the smallest set of tokens whose cumulative probability exceeds p
Top_k: A decoding strategy that samples from the k most likely next tokens