← Back to Paper List

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Yue Cao, Yangzhou Liu, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Han Lv, De-Hua Chen, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zhang, Wenqi Shao, Junjun He, et al.
Shanghai Artificial Intelligence Laboratory, Nanjing University, Tsinghua University, University of Science and Technology of China
arXiv.org (2025)
MM Pretraining RL Benchmark Agent

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Vision-Language Pre-training
InternVL3 replaces post-hoc adaptation with native multimodal pre-training, jointly optimizing vision and language parameters from the start to improve alignment and efficiency without complex bridging stages.
Core Problem
Most MLLMs use 'post-hoc' adaptation where a frozen text-only LLM is retrofitted with a vision encoder, creating modality alignment gaps and requiring complex, resource-intensive multi-stage fine-tuning.
Why it matters:
  • Existing pipelines often freeze parameters or require specialized auxiliary data to prevent degrading the LLM's core language skills
  • Bridging modality gaps after the fact is inefficient compared to learning joint representations from the beginning
  • Current approaches struggle with long multimodal contexts and complex reasoning due to rigid positional encodings and distribution shifts
Concrete Example: In a standard 'post-hoc' MLLM, the language model is pre-trained only on text; when adapted to vision, it often hallucinates or fails to ground visual details because the parameters weren't optimized for visual signals. InternVL3 trains on both simultaneously, so 'blue' is learned alongside pixels of blue objects.
Key Novelty
Native Multimodal Pre-training Paradigm
  • Jointly trains all model parameters (ViT, MLP, LLM) on interleaved text and multimodal data from the start, rather than adapting a pre-trained text model later
  • Uses Variable Visual Position Encoding (V2PE) to dynamically assign fractional position indices to visual tokens, allowing better handling of long contexts
Evaluation Highlights
  • 72.2 score on the MMMU benchmark (InternVL3-78B), setting a new state-of-the-art for open-source MLLMs
  • Surpasses InternVL2.5 across reasoning, document understanding, and OCR tasks
  • Competitive with top proprietary models including GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro
Breakthrough Assessment
9/10
Significantly simplifies the MLLM training pipeline by proving 'native' pre-training works at scale, achieving SOTA open-source results and parity with closed models.
×