SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive vision encoder used to extract image features
VLM: Vision-Language Model—a model that processes both images and text to generate text outputs
FSDP: Fully Sharded Data Parallel—a memory-efficient training strategy that shards model parameters across devices
OCR: Optical Character Recognition—converting images of text into machine-encoded text
RadGraph F1: A metric for evaluating radiology reports by comparing the overlap of clinical entities and relations in the generated vs. reference text
TEDS: Tree Edit Distance Similarity—a metric for evaluating table recognition by comparing the tree structure of HTML outputs
Logits soft-capping: A technique to constrain the magnitude of logits in the attention mechanism to improve training stability
SMILES: Simplified Molecular Input Line Entry System—a string notation for representing chemical structures
IoU: Intersection over Union—a metric measuring the overlap between a predicted bounding box and the ground truth box