VLM: Vision-Language Model—a model that accepts images and text as input and generates text output
OCR: Optical Character Recognition—the conversion of images of typed, handwritten, or printed text into machine-encoded text
DocVQA: A benchmark dataset for Visual Question Answering on documents
cross-attention architecture: A VLM design where visual features condition a frozen LLM via interleaved attention layers (e.g., Flamingo)
self-attention architecture: A VLM design where visual features are treated as tokens, concatenated with text, and processed by the LLM's standard self-attention (e.g., LLaVA)
perceiver resampler: A module that reduces a variable number of visual features into a fixed, smaller number of visual tokens using cross-attention
SigLIP: A vision encoder optimized for image-text alignment, often used as a backbone in VLMs
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that freezes pre-trained weights and injects trainable rank decomposition matrices