M-RoPE: Multimodal Rotary Positional Embedding—a technique that splits positional embeddings into time, height, and width components to represent 3D space-time coordinates
Naive Dynamic Resolution: A strategy that maps images to a variable number of visual tokens based on their native resolution and aspect ratio, rather than resizing to a fixed square
ViT: Vision Transformer—a neural network that processes images by splitting them into fixed-size patches
pooling: Reducing the number of tokens by combining adjacent feature vectors (e.g., 2x2 pooling turns 4 tokens into 1)
C-Abstractor: A visual projector module used in previous Qwen models; replaced here by simple pooling and MLP
SFT: Supervised Fine-Tuning—training on instruction-response pairs
MathVista: A benchmark evaluating mathematical reasoning in visual contexts
DocVQA: Document Visual Question Answering—a benchmark for reading and understanding text in documents
RoPE: Rotary Positional Embedding—a method to encode token position by rotating the query/key vectors in the attention mechanism