Kimi-VL Technical Report

📝 Paper Summary

Vision-Language Models (VLM) Mixture-of-Experts (MoE) Efficient Multimodal Learning

Kimi-VL combines a native-resolution vision encoder with an efficient MoE language model and reinforcement learning to achieve strong multimodal reasoning and long-context understanding with only 2.8B activated parameters.

Core Problem

Existing open-source VLMs often rely on dense architectures that are computationally heavy, lack support for long Chain-of-Thought (CoT) reasoning, or use fixed-size vision encoders that struggle with varying resolutions.

Why it matters:

Traditional fixed-size vision encoders require complex splitting/splicing for high-resolution inputs, limiting adaptability
Most efficient open-source VLMs lack the long-horizon reasoning capabilities (System 2 thinking) seen in proprietary models like o1 or Kimi k1.5
Dense architectures scale poorly compared to Mixture-of-Experts (MoE) for high-throughput deployment

Concrete Example: When processing high-resolution images or long documents, models with fixed positional embeddings (like SigLIP's original implementation) fail to generalize because interpolated embeddings become inadequate as resolution increases, causing fine-grained details to be lost.

Key Novelty

Kimi-VL (Efficient MoE VLM with Long-Thinking)

Integrates a native-resolution vision encoder (MoonViT) that uses sequence packing (NaViT-style) and 2D Rotary Positional Embeddings to handle variable aspect ratios without padding
Employs a multi-stage training pipeline culminating in Reinforcement Learning (RL) to internalize long Chain-of-Thought reasoning strategies (planning, reflection) for multimodal tasks

Architecture

The architectural components and data flow of Kimi-VL.

Evaluation Highlights

Achieves 64.0 on MMMU (multimodal reasoning) with Kimi-VL-Thinking, demonstrating strong reasoning capabilities
Scores 80.1 on MathVista and 83.2 on V* (high-resolution perception), showing robust fine-grained visual understanding
Attains 64.5 on LongVideoBench using a 128K context window, effectively handling long-context video understanding

Breakthrough Assessment

8/10

Delivers competitive reasoning and long-context performance in a highly efficient 2.8B activated parameter package, bridging the gap between open-source efficient models and flagship proprietary systems.

⚙️ Technical Details

Problem Definition

Setting: Multimodal generative modeling (Image/Video/Text input -> Text output)

Inputs: Multimodal sequence x consisting of text tokens and visual inputs (images or video frames) of varying resolutions

Outputs: Textual response y (reasoning trace and final answer)

Pipeline Flow

MoonViT (Vision Encoder)
MLP Projector
Moonlight (MoE Language Model)

System Modules

MoonViT

Encodes visual inputs at native resolution using patch packing

Model or implementation: Based on SigLIP-SO-400M, continually pre-trained

MLP Projector

Projects visual features into the LLM's embedding space

Model or implementation: 2-layer MLP with pixel shuffle

Moonlight

Generates text responses using retrieved visual context and internal knowledge

Model or implementation: MoE Transformer (2.8B activated, 16B total)

Novel Architectural Elements

Integration of NaViT-style packing (variable resolution support) directly into the vision encoder of an MoE VLM pipeline
Hybrid positional encoding in Vision Tower: interpolating SigLIP's absolute embeddings + adding 2D RoPE for fine-grained spatial awareness

Modeling

Base Model: Moonlight MoE (16B total, 2.8B activated parameters)

Training Method: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)

Objective Functions:

Purpose: Pre-train vision encoder.

Formally: L = L_siglip + lambda * L_caption (CoCa-style objective)
Purpose: Optimize policy for reasoning.

Formally: Online policy mirror descent optimizing expected reward regularized by KL divergence (relative entropy)
Purpose: Prevent overthinking during RL.

Formally: Length-based reward penalty for excessively long responses

Adaptation: Full model training (Vision + Projector + LLM) during specific stages

Trainable Parameters: Vision Encoder (400M), Projector, and MoE LLM (16B total)

Training Data:

Pre-training: 2.3T tokens (joint text + multimodal)
Cooldown: High-quality synthetic QA, math, code, and rewritten visual data
Long-context: Long videos, documents, and interleaved data up to 128K length
RL: Warmup set of long-CoT prompts followed by RL on reasoning problems

Key Hyperparameters:

context_window: 128K
lambda_caption_loss: 2
learning_rate_decay_stage_1: 2e-5 to 2e-6
+ 2 more
learning_rate_stage_2: 1e-5 to 1e-6
rope_theta_reset: 800,000

Compute: Training throughput approx. 60% higher than a 7B dense VLM (e.g. Qwen2.5-7B based)

Comparison to Prior Work

vs. DeepSeek-VL2: Kimi-VL supports 128K context (vs 4K) and native resolution processing
vs. LLaVA-OneVision: Kimi-VL uses native patch packing (NaViT) instead of splitting images into sub-grids
vs. Qwen2.5-VL: Kimi-VL uses an MoE architecture for better inference efficiency (2.8B activated params)
+ 1 more
vs. OpenAI o1: Kimi-VL is open-weights and targets efficiency rather than maximizing scale [not cited in paper as direct baseline, but conceptual comparison]

Limitations

Dependency on synthetic data for cooldown and reasoning stages requires careful quality control
No specific breakdown of performance on extremely low-resource languages provided
RL training for long-thinking requires sophisticated reward engineering and prevents simple reproduction without the exact reward models

Reproducibility

Code: https://github.com/MoonshotAI/Kimi-VL

Code and models publicly available at https://github.com/MoonshotAI/Kimi-VL. Data loader system described in detail. RL prompts and specific dataset mixtures are described qualitatively but not released as raw files.

📊 Experiments & Results

Evaluation Setup

Comprehensive multimodal evaluation across reasoning, perception, and long-context tasks.

Benchmarks:

MMMU (Multimodal Reasoning (College-level))
MathVista (Visual Math Reasoning)
LongVideoBench (Long-context Video Understanding)
InfoVQA (Visual Question Answering (Document/Scene))

Metrics:

Accuracy
Score (Benchmark specific)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Kimi-VL demonstrates higher training throughput compared to dense baselines due to its MoE architecture and optimized parallelization.
Training Throughput	Relative Throughput	1.0	1.6	+0.6

Experiment Figures

The four-stage pre-training pipeline of Kimi-VL.

Main Takeaways

Kimi-VL-Thinking achieves state-of-the-art level performance (e.g., 64.0 on MMMU) among efficient models, surpassing GPT-4o in specific reasoning domains despite having fewer activated parameters.
The native-resolution vision encoder (MoonViT) allows for superior performance on high-resolution tasks like InfoVQA (83.2) and ScreenSpot-Pro (34.5/52.8) compared to fixed-resolution counterparts.
Long-context training effectively extends capabilities to 128K tokens, evidenced by strong results on LongVideoBench (64.5), validating the stability of the MoE architecture for long sequences.

📚 Prerequisite Knowledge

Prerequisites

Mixture-of-Experts (MoE) transformer architecture
Vision Transformer (ViT) basics
Reinforcement Learning (RL) for LLMs (PPO/DPO concepts)

Key Terms

MoE: Mixture-of-Experts—a neural network architecture where only a subset of parameters (experts) are activated for each token, improving efficiency

CoT: Chain-of-Thought—a prompting or training technique where the model generates intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training the model on labeled instruction-response pairs

RL: Reinforcement Learning—training method where the model learns to maximize a reward signal (e.g., answer correctness)

RoPE: Rotary Positional Embedding—a method for encoding positional information in transformers by rotating query and key vectors

SigLIP: Sigmoid Loss for Language Image Pre-training—a contrastive learning method for aligning image and text representations

NaViT: Native Resolution Vision Transformer—a technique to process images of arbitrary resolutions by packing patches into a sequence without padding

ZeRO: Zero Redundancy Optimizer—a memory optimization technique for distributed training of large models

Muon: A momentum-based optimizer designed for efficient large-scale training

NTP: Next Token Prediction—the standard training objective for language models

NIAH: Needle-In-A-Haystack—an evaluation measuring a model's ability to retrieve specific information from a long context window