MM-LLMs: MultiModal Large Language Models—models that extend LLMs to support inputs or outputs of other modalities (images, audio) alongside text
PEFT: Parameter-Efficient Fine-Tuning—techniques like LoRA or Prefix-tuning that fine-tune only a small subset of parameters to reduce computational cost
ICL: In-Context Learning—the ability of a model to perform tasks based on examples provided in the prompt without parameter updates
Modality Encoder: A component that encodes inputs from diverse modalities (images, audio) into feature representations
Input Projector: A module that aligns encoded features from other modalities with the text feature space of the LLM
Output Projector: A module that maps LLM signal tokens into features understandable by the Modality Generator
Modality Generator: A component (often a diffusion model) that synthesizes content in distinct modalities based on features from the Output Projector
LDM: Latent Diffusion Model—a type of generative model used for synthesizing high-quality images or audio
Q-Former: A transformer-based input projector that extracts relevant features from encoded inputs using learnable queries (used in BLIP-2)
MM PT: MultiModal Pre-Training—the first training stage focused on aligning modality features
MM IT: MultiModal Instruction-Tuning—the second training stage focused on aligning the model with human intent