MLLM: Multi-modal Large Language Model—an LLM adapted to process non-text inputs like images alongside text
DPO: Direct Preference Optimization—an alignment method that optimizes the policy directly on preference pairs without training an explicit reward model
Visual Instruction Tuning: Fine-tuning an LLM on pairs of images and instructions to enable visual understanding
Catastrophic Forgetting: The phenomenon where a model forgets previously learned information (e.g., text skills) when trained on new data (e.g., visual tasks)
SFT: Supervised Fine-Tuning—training a model to mimic reference answers
RLHF: Reinforcement Learning from Human Feedback—aligning models using a reward model trained on human preferences
SteerLM: A method that conditions model generation on specified attribute scores (e.g., helpfulness: 5) during training and inference
Modality Conflict: Interference between different data types (text vs. image) during training that harms performance in one or both domains
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of weights
Chain-of-Thought: Prompting the model to generate intermediate reasoning steps before the final answer