InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Vision-Language Pre-training

InternVL3 replaces post-hoc adaptation with native multimodal pre-training, jointly optimizing vision and language parameters from the start to improve alignment and efficiency without complex bridging stages.

Core Problem

Most MLLMs use 'post-hoc' adaptation where a frozen text-only LLM is retrofitted with a vision encoder, creating modality alignment gaps and requiring complex, resource-intensive multi-stage fine-tuning.

Why it matters:

Existing pipelines often freeze parameters or require specialized auxiliary data to prevent degrading the LLM's core language skills
Bridging modality gaps after the fact is inefficient compared to learning joint representations from the beginning
Current approaches struggle with long multimodal contexts and complex reasoning due to rigid positional encodings and distribution shifts

Concrete Example: In a standard 'post-hoc' MLLM, the language model is pre-trained only on text; when adapted to vision, it often hallucinates or fails to ground visual details because the parameters weren't optimized for visual signals. InternVL3 trains on both simultaneously, so 'blue' is learned alongside pixels of blue objects.

Key Novelty

Native Multimodal Pre-training Paradigm

Jointly trains all model parameters (ViT, MLP, LLM) on interleaved text and multimodal data from the start, rather than adapting a pre-trained text model later
Uses Variable Visual Position Encoding (V2PE) to dynamically assign fractional position indices to visual tokens, allowing better handling of long contexts

Evaluation Highlights

72.2 score on the MMMU benchmark (InternVL3-78B), setting a new state-of-the-art for open-source MLLMs
Surpasses InternVL2.5 across reasoning, document understanding, and OCR tasks
Competitive with top proprietary models including GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro

Breakthrough Assessment

9/10

Significantly simplifies the MLLM training pipeline by proving 'native' pre-training works at scale, achieving SOTA open-source results and parity with closed models.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Autoregressive Pre-training and Generation

Inputs: Interleaved sequences of text tokens and visual inputs (images/video)

Outputs: Next text token prediction (visual tokens serve as context)

Pipeline Flow

Vision Encoder (InternViT) extracts features
Pixel Unshuffle reduces visual token count
MLP Projector aligns visual features
LLM (Qwen/InternLM) generates text response

System Modules

Vision Encoder (Input Processing)

Extract visual features from images/video

Model or implementation: InternViT-300M or InternViT-6B

Connector (Input Processing)

Project visual embeddings to LLM input space

Model or implementation: Two-layer MLP (Randomly initialized)

Large Language Model

Generate text response conditioned on multimodal inputs

Model or implementation: Qwen2.5 series or InternLM3-8B (Base models)

Novel Architectural Elements

Variable Visual Position Encoding (V2PE): Uses fractional increments (delta < 1) for visual tokens while keeping delta=1 for text, enabling longer context handling

Modeling

Base Model: InternViT (Vision) + Qwen2.5/InternLM3 (Language)

Training Method: Native Pre-training -> SFT -> Mixed Preference Optimization (MPO)

Objective Functions:

Purpose: Pre-training (Autoregressive).

Formally: L = - sum(w_i * log P(x_i | x_<i)) calculated on text tokens only.
Purpose: Preference Learning (MPO).

Formally: Combination of DPO loss (preference), BCO loss (quality), and LM loss (generation).
Purpose: DPO Loss.

Formally: L_DPO = - log sigma(beta * log(pi/pi_ref) [chosen] - beta * log(pi/pi_ref) [rejected])
Purpose: BCO Loss.

Formally: L_BCO = - log sigma(beta * log(pi) - delta) for chosen/rejected independently

Adaptation: Full parameter update (Joint Parameter Optimization) during pre-training; no freezing

Training Data:

Pre-training: 200B tokens total (50B Language, 150B Multimodal)
SFT: 21.7M samples (expanded tools, GUI, 3D, reasoning)
MPO: 300K samples (Preference pairs)

Key Hyperparameters:

pretraining_data_ratio: 1:3 (Language to Multimodal)
visual_token_count: 256 per 448x448 tile
beta (KL penalty): Not explicitly reported in snippet
+ 1 more
MPO_rollout_models: SFT versions of InternVL3-8B, 38B, 78B

Compute: Training speedup of 50-200% via InternEVO optimization (specific GPU hours not reported in snippet)

Comparison to Prior Work

vs. InternVL2.5: InternVL3 uses native joint pre-training instead of post-hoc adaptation; adds MPO and V2PE
vs. Qwen2.5-VL: InternVL3 employs Mixed Preference Optimization (MPO) combining DPO and BCO
vs. Standard MLLMs: Uses V2PE with fractional position indices for visual tokens instead of standard integer increments

Limitations

Requires massive scale data (200B tokens) for native pre-training
Computationally intensive joint optimization of all parameters compared to adapter-only training
V2PE effectiveness depends on the selection of fractional delta values

Reproducibility

Training data and model weights will be publicly released. Code URL not explicitly in snippet but authors promise release. Uses open-source base models (Qwen, InternLM).

📊 Experiments & Results

Evaluation Setup

Broad spectrum MLLM evaluation across reasoning, OCR, video, and general understanding.

Benchmarks:

MMMU (Multi-discipline multimodal reasoning)

Metrics:

Accuracy / Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MMMU	Score	Not reported in the paper	72.2	Not reported in the paper

Main Takeaways

Native multimodal pre-training enables competitive performance with proprietary models (GPT-4o, Claude 3.5) without complex post-hoc adaptation pipelines.
The 1:3 ratio of language to multimodal data during pre-training was empirically found to yield the best overall performance.
Mixed Preference Optimization (MPO) and Test-Time Scaling (Best-of-N) further enhance reasoning capabilities post-training.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (ViT and LLM)
Autoregressive language modeling
Preference optimization (DPO/RLHF)

Key Terms

V2PE: Variable Visual Position Encoding—a mechanism using fractional position increments for visual tokens to fit more visual context into the window

MPO: Mixed Preference Optimization—a post-training phase combining preference loss (DPO), quality loss (BCO), and generation loss to align model outputs

SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs

DPO: Direct Preference Optimization—a method to align models to human preferences without a separate reward model

BCO: Binary Classifier Optimization—used here as a quality loss to help the model distinguish absolute response quality

Pixel Unshuffle: An operation that rearranges spatial blocks of pixels into the channel dimension, reducing sequence length (used here to reduce 448x448 tiles to 256 tokens)

InternEVO: An optimized training infrastructure extending ZeRO for efficient large-scale MLLM training

VisualPRM: Visual Process Reward Model—a critic model used during inference to score steps in a chain-of-thought solution