PaLI-3: A smaller scale (5B parameter) Vision-Language Model consisting of a ViT vision backbone and a UL2 language backbone
SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive loss function used for training vision encoders
UL2: Unifying Language Learning—a pre-training objective for language models that mixes different denoising tasks
ChartQA: A benchmark dataset for question answering on charts, containing both human-written and machine-generated questions
Rationale: A step-by-step explanation or reasoning trace generated by a model to justify an answer
Program-of-Thought (PoT): A prompting technique where the model generates executable code (like Python) to solve reasoning problems, rather than just text
Derendering: The task of translating a visual chart back into its underlying data table or code representation
Multi-task setup: Training a model to perform multiple distinct tasks (e.g., answering questions and generating rationales) simultaneously with specific prefixes
Vision Transformer (ViT): A model architecture that applies the Transformer mechanism directly to sequences of image patches
OCR: Optical Character Recognition—technology to convert images of text into machine-encoded text