MiMo-VL Technical Report

📝 Paper Summary

Vision-Language Models (VLMs) Multimodal Reasoning

MiMo-VL-7B combines a four-stage pre-training pipeline enriched with synthetic Chain-of-Thought reasoning data and a post-training phase using Mixed On-policy Reinforcement Learning to achieve state-of-the-art multimodal performance.

Core Problem

Traditional VLM pre-training often relies on short QA pairs that lead to superficial pattern matching, while post-training rarely optimizes diverse capabilities (reasoning, grounding, preference) simultaneously.

Why it matters:

Standard QA data restricts models to surface-level understanding, failing to develop complex logical reasoning required for 'thinking' models
Simultaneously improving diverse capabilities (e.g., visual grounding vs. logical reasoning) is difficult due to interference and differing convergence rates across domains

Concrete Example: In current VLMs, training on short-answer data prevents the model from learning generalizable reasoning patterns. Conversely, MiMo-VL incorporates synthesized long Chain-of-Thought data during pre-training to enable complex problem-solving in domains like STEM.

Key Novelty

Mixed On-policy Reinforcement Learning (MORL) & Reasoning-Heavy Pre-training

Integrates diverse reward signals (perception accuracy, grounding precision, reasoning, human preference) into a single on-policy RL framework, unlike methods that optimize these separately.
Injects massive amounts of synthetic reasoning data with long Chain-of-Thought directly into pre-training stages rather than just fine-tuning, preventing saturation and enabling deeper logic learning.

Architecture

The overall architecture of MiMo-VL-7B, showing the Vision Transformer, MLP Projector, and LLM backbone.

Evaluation Highlights

Outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, achieving 66.7 on the MMMU benchmark.
Sets a new state-of-the-art for GUI grounding with 56.1 on OSWorld-G, surpassing specialized models like UI-TARS.
Achieves 59.4 on OlympiadBench, outperforming larger models with up to 78B parameters in multimodal reasoning.

Breakthrough Assessment

8/10

Strong performance for a 7B model, particularly in GUI agents and reasoning. The successful integration of mixed on-policy RL for multimodal tasks is a significant methodological consolidation.

⚙️ Technical Details

Problem Definition

Setting: General-purpose vision-language understanding and reasoning

Inputs: Multimodal inputs including images, videos, and text prompts

Outputs: Textual responses, bounding box coordinates for grounding, or action trajectories for GUI agents

Pipeline Flow

Visual Encoding (ViT) → Projection (MLP) → Language Modeling (LLM)

System Modules

Visual Encoder (Input Processing)

Encodes visual inputs (images/videos) into latent representations

Model or implementation: Qwen2.5-ViT (native resolution)

Projector (Input Processing)

Maps visual embeddings to the LLM's token space

Model or implementation: Multi-Layer Perceptron (MLP)

Language Backbone

Performs reasoning and text generation based on multimodal context

Model or implementation: MiMo-7B-Base

Novel Architectural Elements

Integration of mixed reward signals (perception, grounding, reasoning, preference) directly into a unified on-policy RL post-training phase

Modeling

Base Model: MiMo-7B-Base (LLM) + Qwen2.5-ViT (Vision)

Training Method: Four-stage pre-training followed by Mixed On-policy Reinforcement Learning (MORL)

Objective Functions:

Purpose: Optimize policy using group relative rewards to stabilize training.

Formally: GRPO (Group Relative Policy Optimization) loss
Purpose: Align model with diverse objectives (accuracy, grounding, reasoning).

Formally: Maximization of mixed reward signals R_mixed covering perception, grounding, reasoning, and preference

Adaptation: Full model training (Vision encoder unfrozen in Stage 2, all params trainable in Stage 3+)

Trainable Parameters: 7B

Training Data:

2.4 trillion tokens total pre-training data
Includes image captions, interleaved image-text, OCR, grounding, video, GUI, and synthetic reasoning data

Key Hyperparameters:

pretraining_stage_4_sequence_length: 32K
pretraining_stage_4_learning_rate: 2.5e-5
pretraining_stage_3_learning_rate: 1e-5

Compute: Not reported in the paper

Comparison to Prior Work

vs. Qwen2.5-VL-7B: MiMo-VL uses Mixed On-policy RL (MORL) and heavy synthetic CoT data, outperforming it on 35/40 tasks.
vs. UI-TARS: MiMo-VL achieves higher GUI grounding accuracy (56.1 vs lower) despite being a general-purpose model, thanks to specific GUI pre-training data.
vs. LLaVA [not cited in paper]: Unlike LLaVA's simple SFT, MiMo-VL employs a multi-stage pre-training with RL-based post-training for alignment.

Limitations

Simultaneous optimization of multiple domains (reasoning vs. grounding) in RL remains unstable due to interference.
Growth trends in response length and task difficulty vary across domains, complicating convergence.
Computationally expensive data curation (2.4T tokens) and multi-stage pipeline may be difficult to replicate without large resources.

Reproducibility

Code: https://github.com/XiaomiMiMo/MiMo-VL

Model checkpoints (SFT and RL versions) and a full evaluation suite covering 50+ tasks are available at https://github.com/XiaomiMiMo/MiMo-VL. Detailed data recipes and processing pipelines are described, but the raw training data itself is not explicitly released.

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation across general visual understanding, multimodal reasoning, and GUI agent tasks.

Benchmarks:

MMMU (Multimodal reasoning and understanding)
OlympiadBench (Complex mathematical and scientific reasoning)
OSWorld-G (GUI grounding and agentic operation)

Metrics:

Accuracy (scores on benchmarks)
Elo rating (for human preference)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MiMo-VL-7B-RL demonstrates superior performance on general visual perception and reasoning benchmarks compared to open-source baselines.
MMMU	Score	Not reported in the paper	66.7	Not reported in the paper
OlympiadBench	Score	Not reported in the paper	59.4	Not reported in the paper
OSWorld-G	Score	Not reported in the paper	56.1	Not reported in the paper

Main Takeaways

Incorporating synthetic reasoning data with long Chain-of-Thought into pre-training significantly improves performance without saturation.
Mixed On-policy RL effectively aligns the model across disparate tasks (grounding, reasoning, preference), though balancing these signals is challenging.
The model generalizes well to agentic tasks like GUI grounding, setting new standards even against specialized models.

📚 Prerequisite Knowledge

Prerequisites

Vision Transformers (ViT)
Reinforcement Learning with Human Feedback (RLHF)
Chain-of-Thought (CoT) prompting

Key Terms

MORL: Mixed On-policy Reinforcement Learning—a framework integrating multiple reward signals (grounding, reasoning, preference) into a single training phase

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates baselines from group averages of outputs for the same input to reduce variance

CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training on labeled input-output pairs to align the model with desired instruction formats

ViT: Vision Transformer—a model architecture that processes images as sequences of patches, used here as the visual encoder

GUI Grounding: The task of mapping natural language instructions or element descriptions to specific coordinate locations on a graphical user interface

phash: Perceptual Hashing—an algorithm that produces a fingerprint for an image based on its visual features, used for deduplication