_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
SAC: Soft Actor-Critic—an off-policy actor-critic algorithm that optimizes a Maximum Entropy objective
MaxEnt RL: Maximum Entropy Reinforcement Learning—an RL paradigm maximizing both expected reward and policy entropy to encourage exploration
Diffusion Model: A generative model that generates data by reversing a stochastic process that gradually adds noise to data
Probability Flow ODE: An Ordinary Differential Equation that describes a deterministic process sharing the same marginal distributions as the stochastic diffusion process, allowing for exact likelihood computation
Q-weighted Noise Estimation: A proposed training objective for the policy network that weights the noise prediction loss by the Q-value to approximate the target MaxEnt policy
Soft Bellman Error: The error between the current Q-value and the target Q-value which includes an entropy bonus term
Signal-to-Noise Ratio (SNR): A measure used in diffusion models to schedule the noise levels (alpha_t) added at each timestep