FID: Fréchet Inception Distance—a metric for image generation quality where lower scores indicate images are closer to real data distributions
CM3: Causal Masked Multi-Modal model—an architecture capable of causal masking (standard autoregressive) and masked infilling
SFT: Supervised Fine-Tuning—training a pre-trained model on specific task instructions (e.g., 'Edit this image') to improve controllability
Autoregressive: A generation method where the model predicts the next token based on all previous tokens in the sequence
CLIP: Contrastive Language-Image Pre-training—a model used here to encode query images/text for retrieving relevant documents
Contrastive Decoding: A decoding strategy that maximizes the difference between a 'strong' model (conditioned on text) and a 'weak' model (unconditional) to improve quality
CFG: Classifier-Free Guidance—a sampling technique that pushes generation towards a conditional prompt and away from a generic/unconditional prompt
Zero-shot: Evaluating a model on a task it was not explicitly trained for (e.g., generating an image from a caption without seeing that specific pair)
CIDEr: A metric for image captioning quality based on human consensus
Dense Retriever: A retrieval system that finds relevant documents using vector embeddings rather than keyword matching