MPO: Mixed Preference Optimization—a method combining preference loss, quality loss, and generation loss
DPO: Direct Preference Optimization—optimizing a policy to satisfy preferences without an explicit reward model
BCO: Binary Classifier Optimization—a quality loss method treating the policy as a binary classifier to distinguish absolute quality of responses
DropoutNTP: Dropout Next Token Prediction—a data construction method where negative samples are generated by truncating a good response and asking the model to complete it without looking at the image (inducing hallucination)
SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs using standard next-token prediction
CoT: Chain-of-Thought—prompting the model to generate intermediate reasoning steps before the final answer
MMPR: MultiModal PReference dataset—the large-scale dataset constructed in this paper
KL divergence: Kullback-Leibler divergence—a measure of how one probability distribution differs from a second, reference probability distribution