InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Vision-Language Reasoning Efficient Inference

InternVL3.5 improves multimodal reasoning and efficiency via a two-stage Cascade Reinforcement Learning framework and a dynamic visual router that adjusts token resolution based on image complexity.

Core Problem

Existing open-source MLLMs lag behind commercial models in complex reasoning and incur high computational costs when processing high-resolution visual contexts.

Why it matters:

Commercial models like GPT-4o create a significant performance gap in agentic and reasoning tasks compared to open-source alternatives.
High-resolution image understanding typically requires processing massive token counts, creating a bottleneck for real-world deployment latency and cost.
Stable and scalable RL frameworks for MLLMs remain an open problem, with current methods often suffering from instability or limited performance ceilings.

Concrete Example: When processing a simple image patch containing minimal detail (e.g., sky), standard models process it with the same high resolution (256 tokens) as a dense document patch, wasting compute. InternVL3.5-Flash detects this and compresses the sky patch to 64 tokens.

Key Novelty

Cascade RL & Visual Resolution Routing

Cascade RL: A coarse-to-fine training strategy starting with offline RL (MPO) for stable warm-up and high-quality rollouts, followed by online RL (GSPO) to refine the output distribution and push the performance ceiling.
Visual Resolution Router (ViR): A trainable module that dynamically decides the compression rate for each image patch, allowing the model to use fewer tokens for simple visual regions without losing accuracy.

Architecture

The architectural difference between InternVL3.5 and InternVL3.5-Flash, highlighting the Visual Resolution Router.

Evaluation Highlights

InternVL3.5-241B-A28B achieves a score of 77.7 on MMMU, narrowing the gap with GPT-5 to 3.9% on general multimodal capabilities.
Achieves up to 4.05x inference speedup compared to InternVL3 by combining the Visual Resolution Router and Decoupled Vision-Language Deployment.
InternVL3.5-Flash reduces visual tokens by 50% while maintaining nearly 100% of the original model's performance.

Breakthrough Assessment

9/10

Significant engineering and methodological advances (Cascade RL, Dynamic Resolution) that effectively close the gap between open-source and top-tier proprietary models while addressing practical deployment costs.

⚙️ Technical Details

Problem Definition

Setting: Multimodal generative modeling with post-training optimization for reasoning and efficiency.

Inputs: Multimodal token sequence x (images and text instructions)

Outputs: Predicted text response y (reasoning steps and final answer)

Pipeline Flow

Vision Server: Image → InternViT → MLP → ViR (Router) → Pixel Shuffle (Compression)
Network: Compact Feature Transmission (TCP/RDMA)
Language Server: Text + Visual Features → LLM → Response

System Modules

Vision Encoder (Vision Processing)

Extract visual features from images

Model or implementation: InternViT-300M or InternViT-6B

Visual Resolution Router (ViR) (Vision Processing)

Determine compression rate for each patch

Model or implementation: Binary classifier

Pixel Shuffle (Vision Processing)

Compress visual tokens based on router decision

Model or implementation: Pixel shuffle operation

Large Language Model

Generate text response

Model or implementation: Qwen3 or GPT-OSS (Dense or MoE)

Novel Architectural Elements

Visual Resolution Router (ViR) integration for dynamic token compression
Decoupled Vision-Language Deployment (DvD) architecture separating ViT and LLM hardware

Modeling

Base Model: InternViT (Vision) + Qwen3/GPT-OSS (Language)

Training Method: Cascade Reinforcement Learning (Offline MPO + Online GSPO) + Visual Consistency Learning (ViCO)

Objective Functions:

Purpose: Pre-training and SFT next token prediction.

Formally: L_NTP = - sum(log P(x_i | x_<i))
Purpose: Offline RL optimization (MPO).

Formally: L_MPO = w_p * L_preference + w_q * L_quality + w_g * L_generation
Purpose: Online RL optimization (GSPO).

Formally: L_GSPO = E [ min(r_t A, clip(r_t, 1-e, 1+e)A) ]
Purpose: Visual Consistency Learning for Flash models.

Formally: L_consistency = KL( P(y|I_high_res) || P(y|I_compressed) )

Adaptation: Full fine-tuning during SFT/RL

Training Data:

Pre-training: 116M samples (250B tokens)
SFT: 56M samples (130B tokens)
Offline RL: MMPR-v1.2 (200K pairs)
Online RL: MMPR-Tiny (70K queries)

Key Hyperparameters:

context_window: 32K tokens
max_sequence_length: 32K tokens
RL_accuracy_filter_min: 0.2
+ 1 more
RL_accuracy_filter_max: 0.8

Compute: Not reported in the paper

Comparison to Prior Work

vs. InternVL3: Adds Cascade RL (Offline+Online) and dynamic visual compression (ViR)
vs. Step-3: Surpasses in text tasks by +2.0 to +8.4 points
vs. GPT-5: Narrows performance gap to <4% on general multimodal benchmarks

Limitations

Offline RL algorithms generally have a lower performance ceiling than online methods (addressed by cascading them).
Online RL algorithms are computationally expensive (addressed by using offline RL as warm-up).
High-resolution understanding increases computational costs (addressed by ViR).

Reproducibility

Models (1B to 241B) and code are stated to be publicly released. Training data sources are described (MMPR, InternLM corpora).

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation across general multimodal, reasoning, and text-centric tasks.

Benchmarks:

MMMU (Multidisciplinary Multimodal Understanding)
MathVista (Mathematical Reasoning)
AI2D (Diagram Understanding)
ChartQA (Chart Understanding)
DocVQA (Document Visual Question Answering)

Metrics:

Accuracy
Overall Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on key reasoning benchmarks demonstrates strong scaling and competitive results against SOTA.
MMMU	Score	73.4	77.7	+4.3
MMMU	Score	69.1	77.7	+8.6
Text Tasks (Average)	Score	Not reported in the paper	Not reported in the paper	+8.4
Inference Speed	Speedup Factor	1.0	4.05	+3.05

Experiment Figures

Scalability of Cascade RL across model sizes.

Main Takeaways

Cascade RL (MPO + GSPO) provides stable and scalable reasoning improvements, outperforming single-stage RL approaches.
The Visual Resolution Router (ViR) enables a 50% reduction in visual tokens with negligible performance loss, proving that not all image patches require high resolution.
InternVL3.5-241B achieves state-of-the-art results among open-source MLLMs, effectively bridging the gap with closed-source models like GPT-5.

📚 Prerequisite Knowledge

Prerequisites

Multimodal Large Language Models (ViT + LLM architecture)
Reinforcement Learning from Human Feedback (RLHF)
Mixture of Experts (MoE)

Key Terms

Cascade RL: A two-stage reinforcement learning framework using offline RL for initial alignment and online RL for refinement

MPO: Mixed Preference Optimization—an offline RL algorithm combining preference, quality, and generation losses

GSPO: General Self-Play Optimization—an online RL algorithm that refines the policy using self-generated rollouts

ViR: Visual Resolution Router—a module that dynamically selects the compression rate (resolution) for image patches

ViCO: Visual Consistency Learning—a training stage to integrate ViR by minimizing divergence between high and low-resolution outputs

DvD: Decoupled Vision-Language Deployment—an inference strategy placing the vision encoder and LLM on separate GPUs to maximize parallelism

MoE: Mixture-of-Experts—a model architecture where only a subset of parameters (experts) are active for each token

NTP: Next Token Prediction—the standard autoregressive loss used in language model pre-training

SFT: Supervised Fine-Tuning—training on high-quality labeled data to align model behavior