CoT: Chain-of-Thought—a prompting method where models generate intermediate reasoning steps before the final answer
Interleave Token: A special token introduced by this paper that triggers the selection of relevant visual tokens from the image encoder to be inserted into the text stream
SFT: Supervised Fine-Tuning—training a model on a labeled dataset to learn a specific behavior or format
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes policies by comparing a group of outputs against each other rather than using a separate value function
OCR: Optical Character Recognition—technology used to convert different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into editable and searchable data
Visual tokens: The discrete vector representations of image patches produced by a vision encoder (like ViT)
Grid-indexed images: Images overlaid with a grid where each cell has a unique index, used here to create ground-truth labels for visual token selection
RL: Reinforcement Learning—training models by rewarding desired behaviors (correct answers) and punishing undesired ones