Ovis2.5 Technical Report

📝 Paper Summary

Multimodal Large Language Models (MLLMs) Visual Perception and Reasoning

Ovis2.5 improves multimodal reasoning by integrating a native-resolution vision transformer to preserve fine details and a reflective 'thinking mode' trained via reinforcement learning to enable self-correction.

Core Problem

Current MLLMs struggle with dense visual content due to fixed-resolution tiling that breaks global structure, and lack deep reasoning capabilities because they are trained on linear paths without self-correction.

Why it matters:

Fixed-resolution encoders necessitate image tiling, which compromises the global structure needed to interpret complex charts and diagrams.
Training on linear Chain-of-Thought (CoT) lacks reflective supervision, preventing models from evaluating and refining their own intermediate reasoning steps.

Concrete Example: When analyzing a complex chart, a standard model using fixed-resolution tiling might misinterpret the global layout or axis relationships because the image is split into arbitrary patches, whereas Ovis2.5 processes the full chart at its native aspect ratio.

Key Novelty

Native-Resolution Perception + Reflective Reasoning Mode

Replaces fixed-size image tiling with NaViT (Native-resolution Vision Transformer), allowing the model to process images of varying aspect ratios directly to preserve layout and detail.
Introduces a 'thinking mode' enabled by training on data with explicit reflection tags (<think>...</think>), allowing the model to trade latency for accuracy by verifying and correcting its own logic.

Architecture

The overall architecture of Ovis2.5 showing the flow from native-resolution images to generated text.

Evaluation Highlights

Ovis2.5-9B achieves an average score of 78.3 on the OpenCompass multimodal leaderboard, setting a new SOTA for open-source MLLMs under 40B parameters.
Ovis2.5-2B achieves 73.9 on OpenCompass, establishing a state-of-the-art result among open-source MLLMs of comparable size.
Achieves a 3–4x end-to-end training speedup via multimodal data packing and hybrid parallelism optimization.

Breakthrough Assessment

8/10

Significantly advances open-source MLLM capabilities by successfully combining native-resolution processing (solving tiling issues) with the 'System 2' reasoning paradigm (reflection) seen in recent LLMs like Qwen3.

⚙️ Technical Details

Problem Definition

Setting: Multimodal generation and reasoning

Inputs: Variable-resolution images and text instructions

Outputs: Textual response (optionally including internal reasoning thoughts)

Pipeline Flow

Input Processing: Native Resolution ViT (NaViT)
Visual Tokenizer & Embedding: Visual Tokenizer -> VET
Generation: Qwen3 LLM

System Modules

Vision Encoder

Extract features from images at native resolution

Model or implementation: NaViT (initialized from SigLIP2-so400m-patch16-512)

Visual Tokenizer (VT) (Visual Tokenizer & Embedding)

Project visual features to probabilistic visual tokens

Model or implementation: Transformer-based tokenizer

Visual Embedding Table (VET) (Visual Tokenizer & Embedding)

Align visual tokens structurally with text embeddings

Model or implementation: Learnable embedding table

Language Model

Generate text and reasoning traces based on multimodal inputs

Model or implementation: Qwen3 (Large Language Model)

Novel Architectural Elements

Integration of NaViT (Native-resolution ViT) into the Ovis VET architecture to eliminate fixed-resolution tiling while maintaining the probabilistic visual embedding alignment

Modeling

Base Model: Qwen3 (LLM) and SigLIP2 (ViT initialization)

Training Method: 5-Phase Curriculum: 3 Pre-training phases + DPO + GRPO

Objective Functions:

Purpose: Pre-train Visual Embedding Table.

Formally: Training restricted to final ViT layer, visual head, and VET.
Purpose: Align to preferences.

Formally: Direct Preference Optimization (DPO) augmented with auxiliary Negative Log-Likelihood (NLL).
Purpose: Improve reasoning via reinforcement learning.

Formally: Group Relative Policy Optimization (GRPO) on verifiable rewards (math/science).

Training Data:

Phase 1: Image-caption pairs (VET training)
Phase 2: Multimodal pre-training (OCR, grounding, captions)
Phase 3: Instruction tuning (text-only, video, multi-image, 'thinking-style' data)
DPO Data: Reasoning traces and general QA preference pairs
RLVR Data: Open-source math/science problems with verifiable rewards

Compute: 3–4x end-to-end speedup achieved via multimodal data packing and hybrid parallelism (Data + Tensor + Context Parallelism).

Comparison to Prior Work

vs. Ovis2: Ovis2.5 uses NaViT instead of fixed tiling, Qwen3 instead of Qwen2.5, and adds 'thinking mode' training.
vs. Standard MLLMs (e.g., LLaVA): Ovis2.5 aligns embeddings via a Visual Embedding Table (VET) rather than a simple MLP projector.

Limitations

Reliance on Qwen3 backbone implies performance is bounded by the base LLM's capabilities.
RLVR (Reinforcement Learning with Verifiable Rewards) is primarily focused on math/science, potentially limiting reasoning improvements in more subjective domains.
High-resolution processing (up to 3.2M pixels) increases computational cost, though mitigated by packing and parallelism.

Reproducibility

Models Ovis2.5-9B and Ovis2.5-2B are released as open source. Code for the Ovis architecture is generally available from prior releases, though the specific Ovis2.5 repository link is not explicitly cited in the text snippet. Training data includes public sets (COYO, Laion, DataComp) and internal data.

📊 Experiments & Results

Evaluation Setup

Evaluation on comprehensive multimodal benchmarks

Benchmarks:

OpenCompass (Multimodal Leaderboard (Aggregate))

Metrics:

Average Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
OpenCompass	Average Score	73.9	78.3	+4.4
OpenCompass	Average Score	Not reported in the paper	73.9	Not reported in the paper

Main Takeaways

Ovis2.5-9B achieves state-of-the-art performance among open-source MLLMs under 40B parameters on OpenCompass.
Ovis2.5-2B maintains the 'small model, big performance' philosophy, setting SOTA for its size class.
The switch to Native-Resolution ViT and inclusion of reflective reasoning training significantly boosts performance on dense visual tasks like chart analysis.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (ViT and LLM)
Multimodal alignment (projectors/embeddings)
Reinforcement Learning from Human Feedback (RLHF)

Key Terms

NaViT: Native-resolution Vision Transformer—a vision encoder that processes images at their original resolutions and aspect ratios without resizing or padding to fixed squares

VET: Visual Embedding Table—a learnable dictionary in the Ovis architecture that structurally aligns visual tokens with text embeddings by storing dedicated embeddings for 'visual words'

DPO: Direct Preference Optimization—a stable method for aligning language models to human preferences using paired data without needing a separate reward model

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes the policy based on the relative performance of a group of outputs

RoPE: Rotary Position Embeddings—a technique for encoding positional information in transformers by rotating embedding vectors, crucial here for handling variable image resolutions

CoT: Chain-of-Thought—a prompting method where models generate intermediate reasoning steps before the final answer

Visual Tokenizer: A component that extracts features from image patches and projects them into a probabilistic distribution over a discrete visual vocabulary