_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.
MLLM: Multi-modal Large Language Models—AI systems capable of processing and reasoning across multiple modalities like text and images
Jailbreak Attack: Malicious inputs designed to bypass a model's safety alignment and induce harmful or prohibited outputs
ASR: Attack Success Rate—the percentage of adversarial attempts that successfully induce a harmful response
White-box Attack: Attacks that utilize access to the model's internal parameters (gradients, architecture) to optimize adversarial inputs
Black-box Attack: Attacks that only interact with the model via inputs/outputs without internal access, often using heuristics or transferability
Visual-Adv: A gradient-based white-box attack optimizing image pixels to bypass safety filters
FigStep: A black-box attack embedding harmful text instructions as typographic visual content (text-in-image) to evade text-based filters
HILF: High-Impact, Low-Frequency events—rare but catastrophic safety failures often missed by aggregate metrics
Intent Alignment: A metric measuring how well the model's response satisfies the user's request, regardless of safety
PixArt-XL-2-1024-MS: A specific text-to-image diffusion model used here to generate risk-related images