← Back to Paper List

Seedream 4.0: Toward Next-generation Multimodal Image Generation

Yun Chen, Yu Gao, Lixue Gong, Meng-Hao Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guangchao Shi, Yichun Shi, Shiqi Sun, Yu-Chen Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, et al.
ByteDance
arXiv.org (2025)
MM Pretraining RL Benchmark

📝 Paper Summary

Multimodal Image Generation Efficient Diffusion Models
Seedream 4.0 integrates text-to-image generation and editing into a unified, efficient Diffusion Transformer (DiT) architecture, achieving sub-2-second high-resolution synthesis via adversarial distillation and specialized quantization.
Core Problem
Current generative models face scalability bottlenecks in model capacity versus computational cost, and often lack unified capabilities for both high-fidelity generation and precise multimodal editing within a single efficient framework.
Why it matters:
  • Separating generation and editing into different models fragments the creative workflow and increases deployment complexity
  • High-resolution generation (2K-4K) in existing diffusion transformers (DiTs) is computationally expensive, limiting real-time interaction
  • Purely top-down data sampling strategies in prior work underrepresent fine-grained knowledge concepts like charts and formulas
Concrete Example: In image editing tasks, competitors like GPT-Image-1 follow instructions well but fail to preserve the original image structure (consistency), while Gemini 2.5 preserves structure but fails to follow complex style transfer instructions. Seedream 4.0 balances both.
Key Novelty
Unified Multimodal DiT with Adversarial Acceleration
  • Combines a highly compressed VAE and efficient DiT backbone to reduce token counts, enabling native 1K-4K training
  • Integrates a Vision Language Model (VLM) for prompt engineering and routing, jointly training the system on generation and editing via causal diffusion
  • Employs an adversarial matching framework (ADP/ADM) rather than standard diffusion paths, allowing the model to jump to the final image in very few steps
Evaluation Highlights
  • Achieves inference time of 1.4 seconds for generating a 2K resolution image, enabling near real-time performance
  • Ranks 1st in both single-image editing and text-to-image tracks on the Artificial Analysis Arena leaderboard, outperforming GPT-Image-1 and Flux
  • Outperforms GPT-Image-1 and Gemini 2.5 by almost 20% on the MagicBench 4.0 GSB metric for multi-image editing
Breakthrough Assessment
9/10
Significantly consolidates generation and editing into one top-tier model with extreme inference speedups (1.4s for 2K). The architectural integration of VLM, DiT, and adversarial distillation sets a new standard for efficiency and unification.
×