MLLM: Multimodal Large Language Model—AI models that can process and generate both text and images
CASPO: Consequence-Aware Safety Policy Optimization—the authors' proposed alignment framework that uses self-distillation and outcome rewards
DPO: Direct Preference Optimization—a method to align language models to preferences without a separate reward model
MDP: Markov Decision Process—a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker
causal blindness: The inability of a model to foresee the future physical or social consequences of an action within a specific visual context
latent hazard: A danger that is not explicitly stated in the text query but emerges from the interaction between the action and the environment (e.g., turning on a switch in a gas-filled room)
preference ceiling: A phenomenon observed where static alignment data stops improving model performance because the model's intrinsic reasoning surpasses the quality of the fixed labels
POS: Part-of-Speech—grammatical categories of words (noun, verb, etc.), used here to analyze token distribution shifts
CDA: Consequence-Driven Alignment—the proposed objective ensuring sequence generation is causally aligned to avoid hazardous environmental transitions
self-distillation: A training process where a model learns to mimic the outputs of a better version of itself (in this case, a version conditioned on a safety constitution)