← Back to Paper List

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, XU Chi, Jian Cong, Qi Cui, Qide Dong, Junliang Fan, Jing Fang, Zetao Fang, Chen Feng, Han Feng, Mingyuan Gao, Yu Gao, Qiushan Guo, Bo Hao, Qing Hao, Bibo He, Qian He, Tuyen Hoang, Ruoqing Hu, Xi Hu, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Siqi Jiang, Wei Jiang, Yunpu Jiang, et al.
ByteDance Seed
arXiv.org (2025)
MM RL Pretraining Benchmark

📝 Paper Summary

Audio-Visual Generation Video Generation Multimodal Foundation Models
Seedance 1.5 pro is a unified foundation model that generates synchronized video and audio simultaneously using a dual-branch architecture refined via reinforcement learning for professional-grade cinematic quality.
Core Problem
Current video generation often produces fragmented visuals lacking sound, or treats audio as a separate post-process, leading to poor synchronization and weak narrative coherence.
Why it matters:
  • Separately generated audio and video often suffer from 'ventriloquism effects' where lip movements do not match speech.
  • Professional content creation (film, ads) requires holistic outputs where sound effects, background music, and dialogue are intrinsically tied to visual events and emotional tone.
  • Existing models struggle with dialect-specific prosody and complex camera movements, limiting their utility for high-end production.
Concrete Example: In a generated clip of a person speaking a specific Chinese dialect, standard models might produce generic lip flaps unrelated to the audio phonemes. Seedance 1.5 pro generates the audio and video jointly, ensuring the lips move in sync with the specific dialect's pronunciation while maintaining consistent facial micro-expressions.
Key Novelty
Native Joint Audio-Visual Generation with RLHF
  • Uses a unified MMDiT (Multimodal Diffusion Transformer) architecture that processes video and audio streams in parallel with cross-modal attention, ensuring temporal lock-step synchronization.
  • Applys RLHF (Reinforcement Learning from Human Feedback) specifically tailored for video-audio tasks, using a multi-dimensional reward model to optimize motion quality, aesthetics, and audio fidelity beyond standard supervised learning.
Evaluation Highlights
  • Achieves >10x inference speedup through a multi-stage distillation framework compared to the unoptimized baseline.
  • Outperforms Veo 3.1 and Kling 2.6 in audio-visual synchronization and Chinese dialect generation according to human side-by-side evaluations.
  • Demonstrates superior lip-sync accuracy and 'video vividness' (action/camera/atmosphere) in the new SeedVideoBench 1.5 compared to predecessor Seedance 1.0 Pro.
Breakthrough Assessment
8/10
Significant for integrating native audio-visual generation with a robust RLHF pipeline for video, a relatively new frontier. Strong practical improvements in lip-sync and speed.
×