Scale-then-Compress: A design paradigm that first increases input resolution/frames for detail, then reduces token count via pooling/reshaping for efficiency
Dynamic-S2: An adaptive image processing method that splits images into tiles based on their native aspect ratio rather than resizing to a fixed square
Spatial-to-Channel (STC): A compression technique that reshapes spatial token grids (e.g., 2x2) into the channel dimension, reducing sequence length by 4x
DeltaLoss: A data pruning metric measuring the difference in loss between a large teacher model and a small student model to identify valuable training examples
SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive vision-language encoder used as the vision tower
Qwen2: A family of dense Large Language Models used as the text backbone
FP8: 8-bit Floating Point format—a reduced precision number format that accelerates matrix multiplications on modern GPUs
AWQ: Activation-aware Weight Quantization—a method for compressing LLM weights to low bit-widths (e.g., 4-bit) while preserving accuracy
ViT: Vision Transformer—the visual encoder component of the VLM
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique