← Back to Paper List

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu, Tianke Zhang, Haonan fan, Kaiyu Jiang, Changyi Liu, Kaiyu Tang, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan
Tsinghua Shenzhen International Graduate School, Tsinghua University, Harbin Institute of Technology, Shenzhen, Chinese Academy of Sciences, Kuaishou Technology
arXiv (2026)
MM RL Reasoning Factuality

📝 Paper Summary

Reinforcement Learning for MLLMs Reward Modeling / Verification
ContextRL enhances multimodal reinforcement learning by augmenting verifiers with full solution contexts to detect reasoning errors and feeding mistake reports back to the policy to recover correct responses.
Core Problem
Standard RLVR frameworks suffer from information bottlenecks: verifiers with limited context cannot reliably distinguish correct reasoning from hallucinations (identifiability), and policies struggle to sample any correct responses for hard queries (reachability).
Why it matters:
  • Verifiers checking only final answers are susceptible to 'false positives' (right answer, wrong reasoning), leading to reward hacking where models learn invalid shortcuts
  • When policies fail to generate any correct response in a sampling group (all-negative), the learning signal collapses, preventing the model from acquiring new knowledge on hard tasks
Concrete Example: A verifier checking only the final answer might reward a math solution that arrives at the correct number via incorrect logic. Conversely, for a hard query, if the model samples 16 incorrect responses, it receives no positive signal to learn 'what to do,' only 'what not to do.'
Key Novelty
Context-Augmented Reinforcement Learning (ContextRL)
  • Augments the reward model with full reference solutions (reasoning + answer) rather than just the final answer, allowing it to generate specific 'mistake reports' for incorrect samples
  • Introduces a multi-turn sampling strategy where the policy receives these mistake reports for failed attempts, guiding it to generate correct 'recovery' responses that are then used for training
Evaluation Highlights
  • Enables Qwen3-VL-8B to achieve performance comparable to the significantly larger 32B model variant
  • Outperforms standard RLVR baselines (like GRPO) by a large margin across 11 perception and reasoning benchmarks
  • Successfully mitigates reward hacking by reducing false-positive samples that have correct answers but flawed reasoning
Breakthrough Assessment
8/10
Addresses fundamental bottlenecks in RLVR (sparse rewards and reward hacking) with a theoretically grounded mechanism. The claim of 8B matching 32B performance is significant.
×