MLLM: Multimodal Large Language Model—AI systems that can process and reason about both text and images.
CLIP: Contrastive Language-Image Pre-training—a model that learns joint representations for images and text, often used as the vision encoder in MLLMs.
Diffusion Model: A generative model that creates images by gradually denoising random noise, often conditioned on text or embeddings.
FID: Fréchet Inception Distance—a metric used to assess the quality of generated images by comparing their distribution to real images.
SSIM: Structural Similarity Index Measure—a metric for measuring the similarity between two images.
OWLv2: Open-Vocabulary Object Detector—a model used here to verify that the target object was not actually inserted into the generated image.
unCLIP: A variation of Stable Diffusion that conditions image generation on CLIP image embeddings rather than just text.
Mapper: A simple Multi-Layer Perceptron (MLP) trained to align CLIP embeddings with the MLLM's vision encoder space.