Seed1.5-VL Technical Report

📝 Paper Summary

Vision-Language Foundation Models Multimodal Agents Video Understanding

Seed1.5-VL integrates a native-resolution vision encoder with a large mixture-of-experts language model, using dynamic video sampling and hybrid reinforcement learning to achieve state-of-the-art multimodal understanding.

Core Problem

Current vision-language models struggle with fine-grained visual details, 3D spatial understanding, and long-tail concept recognition due to fixed-resolution encoders and imbalanced training data.

Why it matters:

Fixed-resolution encoders discard critical details in high-resolution images and OCR tasks, limiting real-world utility
Standard pre-training data is heavily skewed toward common concepts, causing models to fail on rare objects or species (the long-tail problem)
Existing video encoding methods often use uniform sampling, which is inefficient for long videos and misses rapid temporal events

Concrete Example: In species classification, a model trained on random web data fails to recognize rare animals (10.46% accuracy) because common species dominate the learning budget, whereas Seed1.5-VL's balanced sampling boosts this to 44.85%.

Key Novelty

Seed-ViT with Dynamic Frame-Resolution Sampling

Uses a vision encoder (Seed-ViT) that natively handles variable image aspect ratios and resolutions using 2D Rotary Positional Embeddings (RoPE), avoiding resizing artifacts
employs a dynamic strategy for video that adjusts both frame rate and spatial resolution based on content complexity, rather than using fixed sampling
Integrates 'Hybrid Reinforcement Learning' that combines human feedback (RLHF) with verifiable rewards (e.g., correct answers for puzzles/math) to improve reasoning

Evaluation Highlights

State-of-the-art performance on 38 out of 60 public benchmarks, including 21 vision-language and 14 video tasks
Outperforms OpenAI CUA and Claude 3.7 in agent-centric tasks like GUI control and gameplay
Balanced data sampling improves rare concept recognition by +34.39 points compared to random sampling in controlled experiments

Breakthrough Assessment

8/10

Presents a highly capable open-style model (though weights/code availability is limited) with significant architectural optimizations for resolution and video, plus a robust recipe for data synthesis and post-training.

⚙️ Technical Details

Problem Definition

Setting: General-purpose multimodal understanding and reasoning across images, videos, and text

Inputs: Multimodal sequence M = {I, V, T} containing images I, videos V, and text T

Outputs: Textual response generated by the language model

Pipeline Flow

Input Processing: Seed-ViT encodes images/videos
Adaptation: MLP projects visual features to LLM space
Generation: Seed1.5-LLM generates text response

System Modules

Seed-ViT

Encodes images and video frames into visual embeddings

Model or implementation: Custom ViT (532M parameters)

MLP Adapter

Projects visual embeddings into the LLM's input dimension

Model or implementation: Two-layer MLP

Seed1.5-LLM

Processes multimodal tokens and text instructions to generate answers

Model or implementation: Mixture-of-Experts (MoE) LLM

Novel Architectural Elements

Dynamic Frame-Resolution Sampling: Jointly optimizes frame rate and spatial resolution for videos within a token budget
Native-Resolution Transform: Processes non-square images without padding or resizing by using dynamic patching and 2D RoPE

Modeling

Base Model: Seed1.5-LLM (20B active parameters, MoE)

Training Method: Hybrid Reinforcement Learning (RLHF + Verifiable Rewards)

Objective Functions:

Purpose: Align visual features with text.

Formally: SigLIP loss (Sigmoid Loss for Language Image Pre-training).
Purpose: Reconstruct masked image patches.

Formally: Cosine similarity loss between student and teacher CLIP features.
Purpose: Optimize policy against preference/reward.

Formally: Standard RL objectives (details not specified, likely PPO or similar).

Training Data:

Stage 0: 16B tokens (MLP training only)
Stage 1: 3T tokens (Multimodal pre-training)
Stage 2: 240B tokens (Annealing & Long-context)

Key Hyperparameters:

stage_1_learning_rate_max: 5.22e-5
stage_2_context_length: 131,072
stage_1_batch_size_tokens: 71M
+ 2 more
weight_decay: 0.1
optimizer: AdamW (beta1=0.9, beta2=0.95)

Compute: Hybrid Parallelism infrastructure required; specific GPU hours not reported

Comparison to Prior Work

vs. Qwen2-VL: Seed-ViT uses native 2D RoPE during pre-training rather than adapting 1D embeddings post-hoc
vs. NaViT: Seed1.5-VL integrates this into a full VLM pipeline with video support and hybrid RL
vs. GPT-4V [not cited in paper]: Seed1.5-VL claims superior performance on agentic tasks and uses a more compact sparse architecture (20B active params)

Limitations

Still suffers from hallucinations, particularly with knowledge priors overriding visual evidence
Struggles with 3D spatial imagination (e.g., viewing objects from novel angles)
Combinatorial search failures in complex visual reasoning tasks (e.g., connecting nodes in a graph without crossing)
Performance on very rare visual concepts remains challenging despite balancing efforts

Reproducibility

Model is accessible via Volcano Engine API (Model ID: doubao-1-5-thinking-vision-pro-250428). Weights, code, and training data are not released. Data synthesis pipelines are described but not open-sourced.

📊 Experiments & Results

Evaluation Setup

Comprehensive evaluation on 60 public benchmarks and internal suites

Benchmarks:

BioTrove-Balanced/Unseen (Species classification (Sandbox experiment))
MMMU (Multimodal reasoning)
MathVista (Math reasoning)

Metrics:

Accuracy
Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Sandbox experiments on the BioTrove dataset demonstrate that balancing data distribution (limiting max samples per common class) significantly improves performance on rare classes.
BioTrove Rare2k	Accuracy	10.46	44.85	+34.39
BioTrove Rare2k	Accuracy	44.85	89.41	+44.56
BioTrove Balanced10k	Accuracy	78.92	79.17	+0.25

Main Takeaways

Balancing training data by capping common concepts is critical for long-tail performance (improving rare class accuracy from ~10% to ~45% without hurting common classes).
Scaling laws hold for multimodal tasks: loss decreases power-law-style with token count, and downstream metrics show log-linear improvement with decreased loss.
Hybrid RL combining verifiable rewards (e.g., puzzles) and human preference is effective for boosting reasoning capabilities.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (ViT and LLM)
Contrastive Learning (CLIP/SigLIP)
Reinforcement Learning from Human Feedback (RLHF)
Mixture-of-Experts (MoE)

Key Terms

MoE: Mixture-of-Experts—a model architecture that activates only a subset of parameters (experts) for each token, increasing capacity without proportional inference cost

RoPE: Rotary Positional Embedding—a method for encoding position information in transformers by rotating the query and key vectors

Seed-ViT: The custom vision encoder used in this paper, designed for native dynamic resolution processing

MIM: Masked Image Modeling—a pre-training task where the model learns to reconstruct masked parts of an image

OCR: Optical Character Recognition—converting images of text into machine-readable text formats

SFT: Supervised Fine-Tuning—training the model on high-quality instruction-response pairs

RLHF: Reinforcement Learning from Human Feedback—fine-tuning models to maximize rewards defined by human preferences

GUI: Graphical User Interface—visual interfaces that the model learns to interact with (clicking, typing)

STEM: Science, Technology, Engineering, and Mathematics—refers here to datasets and tasks involving academic reasoning

ViT: Vision Transformer—an architecture that applies transformers to image patches