MLLM: Multimodal Large Language Model—AI models that can process and generate both text and images.
visual instruction tuning: Training MLLMs on pairs of images and corresponding instruction-response text to improve their ability to follow user commands.
post-training: Training phases (like instruction tuning) applied after the initial large-scale pre-training to refine model behavior.
CoT: Chain-of-Thought—a prompting strategy where the model generates intermediate reasoning steps before the final answer.
two-stage training: A common MLLM training paradigm: Stage 1 aligns image-text features using captions; Stage 2 fine-tunes on instruction-response pairs.
seed data: A small, high-quality dataset used to fine-tune the synthesizer model so it learns the desired output format.
modality-balancing: A strategy during synthesizer training where some images are replaced with blank ones to force the model to rely on text captions, preventing over-reliance on visual features.
consistency-based filter: A quality control method where a model checks if two different generated responses (e.g., precise vs. informative) to the same prompt are logically consistent.