Latent Embeddings: Continuous vector representations generated by the model's decoder layer, used as inputs for subsequent steps instead of discrete text tokens
VLPO: Visual-Latent Policy Optimization—an RL algorithm that enables policy gradient updates on continuous latent vectors by estimating their probability density
GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of outputs for the same input, typically used for text
Observation Tokens: Text tokens in a reasoning chain that describe specific visual features or findings derived from the image
Auxiliary Images: Intermediate images (e.g., crops, grounding highlights) used in training data to guide the reasoning process
SFT: Supervised Fine-Tuning—training the model on labeled input-output pairs
NTP: Next-Token Prediction—the standard loss function for training language models
OOD: Out-of-Distribution—tasks or data types not seen during training