_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
MLRM: Multi-modal Large Reasoning Model—an MLLM enhanced with explicit reasoning capabilities (often outputting 'thinking' steps)
MLLM: Multi-modal Large Language Model—a model capable of processing and generating text and images (e.g., GPT-4V, LLaVA)
ASR: Attack Success Rate—the percentage of adversarial queries that successfully elicit unsafe or harmful responses
HR: Harmfulness Rating—a score (typically 1-5) assigned to a model's output by a judge (like GPT-4) to quantify its toxicity or danger
Jailbreaking: The process of manipulating a model with specially crafted inputs (prompts or images) to bypass its safety filters
SFT: Supervised Fine-Tuning—training a model on labeled examples (here, reasoning chains) to instill specific behaviors
RL: Reinforcement Learning—training a model via rewards/penalties to optimize behavior
R1-style reasoning: Reasoning capabilities similar to DeepSeek-R1, characterized by generating long chains of thought before the final answer
Self-Correction: The ability of a model to realize during its reasoning process that it is generating unsafe content and refuse to output it in the final answer
Reasoning Tax: The degradation in safety alignment observed when a model is fine-tuned or trained to perform complex reasoning