MM-LLM: Multimodal Large Language Model—an LLM extended to process and/or generate non-text modalities
ImageBind: A unified encoder model capable of encoding data from six different modalities into a shared embedding space
Diffusion Model: A class of generative models that create data (like images or audio) by reversing a noise-adding process
LoRA: Low-Rank Adaptation—a technique to fine-tune large models by updating only a small set of added parameters
MosIT: Modality-switching Instruction Tuning—a training phase introduced in this paper to teach the model to switch between generating different modalities based on context
Signal Tokens: Special tokens (e.g., [IMG], [AUD]) generated by the LLM to signal the decoder to start generating non-text content
Vicuna: An open-source text-based Large Language Model derived from LLaMA
Concept Tokens: Learnable tokens designed to aggregate grid-level features (like image patches) into semantic units closer to language tokens