LLM: Large Language Model—a neural network trained on vast text to generate human-like language
MLLM: Multi-modal Large Language Model—an LLM capable of processing inputs like images in addition to text
ViT: Vision Transformer—a model that processes images as sequences of patches using attention mechanisms
CLIP: Contrastive Language-Image Pre-training—a model trained to match images with their text descriptions
DINOv2: A self-supervised vision transformer trained without labels to learn robust visual features
Q-Former: A module from BLIP-2 that bridges frozen image encoders and LLMs using learnable query vectors
SAM: Segment Anything Model—a foundation model for image segmentation that can cut out objects based on prompts
Stable Diffusion: A generative AI model that creates images from text descriptions
RefinedWeb: A large-scale dataset of high-quality web text used to maintain LLM language capabilities during training
LAION-400M: A massive dataset of image-text pairs from the internet
LAION-COCO: A dataset where images have synthetic captions generated by an AI model
visual embeddings: Numerical vector representations of image content produced by an encoder
sub-images: Smaller cropped sections of a high-resolution image processed independently to preserve detail