← Back to Paper List

Credit Where It is Due: Cross-Modality Connectivity Drives Precise Reinforcement Learning for MLLM Reasoning

Zhengbo Jiao, Shaobo Wang, Zifan Zhang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang
Alibaba Group Holding Limited, Shanghai Jiao Tong University, Shanghai University of Finance and Economics, Wuhan University
arXiv (2026)
MM RL Reasoning Benchmark

📝 Paper Summary

Multimodal Reinforcement Learning Visual Chain-of-Thought Reasoning
AT-RL improves multimodal reasoning by identifying 'perceptual anchor' tokens via cross-modal attention and concentrating reinforcement learning credits on them rather than broadcasting signals uniformly.
Core Problem
Standard Reinforcement Learning with Verifiable Rewards (RLVR) broadcasts feedback credits uniformly across all tokens, failing to distinguish between tokens that actually interpret visual evidence and those that merely follow linguistic patterns.
Why it matters:
  • Multimodal models often generate fluent reasoning chains that are not grounded in the actual image, leading to hallucinations
  • Uniform credit assignment dilutes the learning signal, preventing the model from learning precisely which visual observations led to the correct or incorrect answer
  • Existing methods like GRPO or DAPO optimize the entire sequence indiscriminately, which is inefficient for correcting specific visual perception errors
Concrete Example: In a geometry problem asking 'Where does the line intersect?', a model might correctly define a midpoint formula (textual knowledge) but incorrectly identify point coordinates from the image (visual perception). Standard RL punishes the valid formula and the wrong coordinate equally. AT-RL focuses the penalty specifically on the coordinate tokens (anchors) that failed to align with the visual input.
Key Novelty
Anchor-Token Reinforcement Learning (AT-RL)
  • Identifies 'perceptual anchors' (top ~15% of tokens) that exhibit high cross-modal attention connectivity to image patches, acting as the bridge between vision and language
  • Uses graph-based partitioning (METIS) on the attention topology to group tokens into semantic clusters, calculating a 'perceptual load' weight for each cluster
  • Modulates the advantage signal (reward) in the RL update step, assigning higher weight to anchor clusters so the model learns primarily from visually grounded tokens
Evaluation Highlights
  • Qwen2.5-VL-32B trained with AT-RL achieves 80.2% on MathVista, surpassing the significantly larger Qwen2.5-VL-72B-Instruct (77.8%)
  • Improves average performance of Qwen2.5-VL-7B by +8.24 percentage points across five math benchmarks (including MathVerse and WeMath) when combined with SAPO
  • Demonstrates robust generalization to video reasoning, improving 64-frame video accuracy on VSI-Bench by +11.8 points over the zero-shot baseline
Breakthrough Assessment
8/10
Offers a physically grounded, computationally efficient solution to the credit assignment problem in multimodal RL. The ability of a 32B model to beat a 72B model is a significant efficiency validation.
×