Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanyi Chen, Jun Tang, Chengkun Hou, Z. Du, T. Zhou, Wenjie Zhang, H. Ding, Jiahe Li, Wen Li, G. Hu, Yiliang Gu, Si-Yu Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, et al.
Alibaba Group
arXiv.org
(2025)
MMReasoningRLPretrainingBenchmark
📝 Paper Summary
Multimodal Large Language Models (MLLMs)Visual Perception and Reasoning
Ovis2.5 improves multimodal reasoning by integrating a native-resolution vision transformer to preserve fine details and a reflective 'thinking mode' trained via reinforcement learning to enable self-correction.
Core Problem
Current MLLMs struggle with dense visual content due to fixed-resolution tiling that breaks global structure, and lack deep reasoning capabilities because they are trained on linear paths without self-correction.
Why it matters:
Fixed-resolution encoders necessitate image tiling, which compromises the global structure needed to interpret complex charts and diagrams.
Training on linear Chain-of-Thought (CoT) lacks reflective supervision, preventing models from evaluating and refining their own intermediate reasoning steps.
Concrete Example:When analyzing a complex chart, a standard model using fixed-resolution tiling might misinterpret the global layout or axis relationships because the image is split into arbitrary patches, whereas Ovis2.5 processes the full chart at its native aspect ratio.
Replaces fixed-size image tiling with NaViT (Native-resolution Vision Transformer), allowing the model to process images of varying aspect ratios directly to preserve layout and detail.
Introduces a 'thinking mode' enabled by training on data with explicit reflection tags (<think>...</think>), allowing the model to trade latency for accuracy by verifying and correcting its own logic.
Architecture
The overall architecture of Ovis2.5 showing the flow from native-resolution images to generated text.
Evaluation Highlights
Ovis2.5-9B achieves an average score of 78.3 on the OpenCompass multimodal leaderboard, setting a new SOTA for open-source MLLMs under 40B parameters.
Ovis2.5-2B achieves 73.9 on OpenCompass, establishing a state-of-the-art result among open-source MLLMs of comparable size.
Achieves a 3–4x end-to-end training speedup via multimodal data packing and hybrid parallelism optimization.
Breakthrough Assessment
8/10
Significantly advances open-source MLLM capabilities by successfully combining native-resolution processing (solving tiling issues) with the 'System 2' reasoning paradigm (reflection) seen in recent LLMs like Qwen3.
⚙️ Technical Details
Problem Definition
Setting: Multimodal generation and reasoning
Inputs: Variable-resolution images and text instructions
Outputs: Textual response (optionally including internal reasoning thoughts)
Pipeline Flow
Input Processing: Native Resolution ViT (NaViT)
Visual Tokenizer & Embedding: Visual Tokenizer -> VET
Generation: Qwen3 LLM
System Modules
Vision Encoder
Extract features from images at native resolution
Model or implementation: NaViT (initialized from SigLIP2-so400m-patch16-512)
Align visual tokens structurally with text embeddings
Model or implementation: Learnable embedding table
Language Model
Generate text and reasoning traces based on multimodal inputs
Model or implementation: Qwen3 (Large Language Model)
Novel Architectural Elements
Integration of NaViT (Native-resolution ViT) into the Ovis VET architecture to eliminate fixed-resolution tiling while maintaining the probabilistic visual embedding alignment
Modeling
Base Model: Qwen3 (LLM) and SigLIP2 (ViT initialization)
DPO Data: Reasoning traces and general QA preference pairs
RLVR Data: Open-source math/science problems with verifiable rewards
Compute: 3–4x end-to-end speedup achieved via multimodal data packing and hybrid parallelism (Data + Tensor + Context Parallelism).
Comparison to Prior Work
vs. Ovis2: Ovis2.5 uses NaViT instead of fixed tiling, Qwen3 instead of Qwen2.5, and adds 'thinking mode' training.
vs. Standard MLLMs (e.g., LLaVA): Ovis2.5 aligns embeddings via a Visual Embedding Table (VET) rather than a simple MLP projector.
Limitations
Reliance on Qwen3 backbone implies performance is bounded by the base LLM's capabilities.
RLVR (Reinforcement Learning with Verifiable Rewards) is primarily focused on math/science, potentially limiting reasoning improvements in more subjective domains.
High-resolution processing (up to 3.2M pixels) increases computational cost, though mitigated by packing and parallelism.
Reproducibility
Models Ovis2.5-9B and Ovis2.5-2B are released as open source. Code for the Ovis architecture is generally available from prior releases, though the specific Ovis2.5 repository link is not explicitly cited in the text snippet. Training data includes public sets (COYO, Laion, DataComp) and internal data.
📊 Experiments & Results
Evaluation Setup
Evaluation on comprehensive multimodal benchmarks
Benchmarks:
OpenCompass (Multimodal Leaderboard (Aggregate))
Metrics:
Average Score
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
OpenCompass
Average Score
73.9
78.3
+4.4
OpenCompass
Average Score
Not reported in the paper
73.9
Not reported in the paper
Main Takeaways
Ovis2.5-9B achieves state-of-the-art performance among open-source MLLMs under 40B parameters on OpenCompass.
Ovis2.5-2B maintains the 'small model, big performance' philosophy, setting SOTA for its size class.
The switch to Native-Resolution ViT and inclusion of reflective reasoning training significantly boosts performance on dense visual tasks like chart analysis.
📚 Prerequisite Knowledge
Prerequisites
Transformer architecture (ViT and LLM)
Multimodal alignment (projectors/embeddings)
Reinforcement Learning from Human Feedback (RLHF)
Key Terms
NaViT: Native-resolution Vision Transformer—a vision encoder that processes images at their original resolutions and aspect ratios without resizing or padding to fixed squares
VET: Visual Embedding Table—a learnable dictionary in the Ovis architecture that structurally aligns visual tokens with text embeddings by storing dedicated embeddings for 'visual words'
DPO: Direct Preference Optimization—a stable method for aligning language models to human preferences using paired data without needing a separate reward model
GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes the policy based on the relative performance of a group of outputs
RoPE: Rotary Position Embeddings—a technique for encoding positional information in transformers by rotating embedding vectors, crucial here for handling variable image resolutions
CoT: Chain-of-Thought—a prompting method where models generate intermediate reasoning steps before the final answer
Visual Tokenizer: A component that extracts features from image patches and projects them into a probabilistic distribution over a discrete visual vocabulary