_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
RLVR: Reinforcement Learning with Verifiable Rewards—RL where the reward is based on objectively checkable correctness (e.g., math problems)
Signal-to-Noise Ratio (SNR): Defined in this paper as the ratio of accurate responses to hallucinated responses among instances where the model provides an answer
Behavioral Calibration: A framework where a model dynamically adjusts its refusal behavior based on a risk threshold t, answering only if confidence p >= t
Proper Scoring Rule: A scoring function where the expected reward is maximized if and only if the predicted probability matches the true probability
Brier Score: A proper scoring rule that measures the mean squared difference between predicted probabilities and actual outcomes
PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used for fine-tuning language models
Critic: In Actor-Critic RL, the network that estimates the value (expected future reward) of the current state
BeyondAIME: A challenging in-domain mathematical reasoning benchmark used to evaluate the model
SimpleQA: A cross-domain factual question answering benchmark used for zero-shot evaluation
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps
Verbalized Confidence: A technique where the model explicitly outputs a scalar confidence score (e.g., '0.8') in text
Log-scale SNR gain: The logarithmic improvement in the Signal-to-Noise Ratio compared to a baseline, used to measure hallucination reduction effectiveness