← Back to Paper List

Seed1.5-VL Technical Report

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, Jingji Chen, Jingjia Huang, Kang Lei, Liping Yuan, Lishu Luo, Pengfei Liu, Qinghao Ye, Rui Qian, Shen Yan, Shixiong Zhao, Shuai Peng, Shuangye Li, Sihang Yuan, Si-Ming Wu, Tianheng Cheng, Weiwei Liu, Wenqian Wang, Xianhan Zeng, Xiao Liu, Xiaobo Qin, et al.
ByteDance
arXiv (2025)
MM Pretraining RL Agent Reasoning Benchmark

📝 Paper Summary

Vision-Language Foundation Models Multimodal Agents Video Understanding
Seed1.5-VL integrates a native-resolution vision encoder with a large mixture-of-experts language model, using dynamic video sampling and hybrid reinforcement learning to achieve state-of-the-art multimodal understanding.
Core Problem
Current vision-language models struggle with fine-grained visual details, 3D spatial understanding, and long-tail concept recognition due to fixed-resolution encoders and imbalanced training data.
Why it matters:
  • Fixed-resolution encoders discard critical details in high-resolution images and OCR tasks, limiting real-world utility
  • Standard pre-training data is heavily skewed toward common concepts, causing models to fail on rare objects or species (the long-tail problem)
  • Existing video encoding methods often use uniform sampling, which is inefficient for long videos and misses rapid temporal events
Concrete Example: In species classification, a model trained on random web data fails to recognize rare animals (10.46% accuracy) because common species dominate the learning budget, whereas Seed1.5-VL's balanced sampling boosts this to 44.85%.
Key Novelty
Seed-ViT with Dynamic Frame-Resolution Sampling
  • Uses a vision encoder (Seed-ViT) that natively handles variable image aspect ratios and resolutions using 2D Rotary Positional Embeddings (RoPE), avoiding resizing artifacts
  • employs a dynamic strategy for video that adjusts both frame rate and spatial resolution based on content complexity, rather than using fixed sampling
  • Integrates 'Hybrid Reinforcement Learning' that combines human feedback (RLHF) with verifiable rewards (e.g., correct answers for puzzles/math) to improve reasoning
Evaluation Highlights
  • State-of-the-art performance on 38 out of 60 public benchmarks, including 21 vision-language and 14 video tasks
  • Outperforms OpenAI CUA and Claude 3.7 in agent-centric tasks like GUI control and gameplay
  • Balanced data sampling improves rare concept recognition by +34.39 points compared to random sampling in controlled experiments
Breakthrough Assessment
8/10
Presents a highly capable open-style model (though weights/code availability is limited) with significant architectural optimizations for resolution and video, plus a robust recipe for data synthesis and post-training.
×