VLM: Vision-Language Model—a model capable of processing and generating text based on both visual and textual inputs
fully autoregressive architecture: A VLM design where visual tokens are concatenated directly to the text embedding sequence, and the single model predicts the next token based on the entire history
cross-attention architecture: A VLM design where visual information is injected into the language model via interleaved cross-attention layers (text attends to image), rather than concatenation
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and trains small rank-decomposition matrices
SigLIP: Sigmoid Loss for Language Image Pre-training—a variant of CLIP training that uses a sigmoid loss instead of softmax, often yielding better performance
Perceiver Resampler: A module that uses cross-attention with a fixed number of latent queries to pool a variable number of visual features into a fixed-length sequence
visual tokens: The vector representations of image patches or pooled image features that are processed by the language model
OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text