RLVR: Reinforcement Learning with Verifiable Rewards—optimizing models using ground-truth outcome correctness (e.g., math answers) rather than human preference labels
Perceptual Anchors: A minority subset of tokens (approx. 15%) in a generated sequence that exhibit high attention weights towards visual inputs, effectively 'grounding' the text in the image
Cross-modal Attention: The attention mechanism in Transformers where text tokens attend to image patch embeddings
METIS: A graph partitioning algorithm used here to cluster tokens based on the similarity of their attention patterns
GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a sample's reward to the average reward of a group of samples for the same input
Attention Sink: A phenomenon where attention heads disproportionately focus on specific tokens (like the first token) regardless of relevance; this paper debiases this effect
Connectivity Density: A metric defined in this paper quantifying the aggregate attention weight a generated text token places on visual patches
Advantage Modulation: The process of re-weighting the standard RL advantage signal (how good an action was) based on token-specific importance (here, visual grounding)