Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

📝 Paper Summary

Audio-Visual Generation Video Generation Multimodal Foundation Models

Seedance 1.5 pro is a unified foundation model that generates synchronized video and audio simultaneously using a dual-branch architecture refined via reinforcement learning for professional-grade cinematic quality.

Core Problem

Current video generation often produces fragmented visuals lacking sound, or treats audio as a separate post-process, leading to poor synchronization and weak narrative coherence.

Why it matters:

Separately generated audio and video often suffer from 'ventriloquism effects' where lip movements do not match speech.
Professional content creation (film, ads) requires holistic outputs where sound effects, background music, and dialogue are intrinsically tied to visual events and emotional tone.
Existing models struggle with dialect-specific prosody and complex camera movements, limiting their utility for high-end production.

Concrete Example: In a generated clip of a person speaking a specific Chinese dialect, standard models might produce generic lip flaps unrelated to the audio phonemes. Seedance 1.5 pro generates the audio and video jointly, ensuring the lips move in sync with the specific dialect's pronunciation while maintaining consistent facial micro-expressions.

Key Novelty

Native Joint Audio-Visual Generation with RLHF

Uses a unified MMDiT (Multimodal Diffusion Transformer) architecture that processes video and audio streams in parallel with cross-modal attention, ensuring temporal lock-step synchronization.
Applys RLHF (Reinforcement Learning from Human Feedback) specifically tailored for video-audio tasks, using a multi-dimensional reward model to optimize motion quality, aesthetics, and audio fidelity beyond standard supervised learning.

Evaluation Highlights

Achieves >10x inference speedup through a multi-stage distillation framework compared to the unoptimized baseline.
Outperforms Veo 3.1 and Kling 2.6 in audio-visual synchronization and Chinese dialect generation according to human side-by-side evaluations.
Demonstrates superior lip-sync accuracy and 'video vividness' (action/camera/atmosphere) in the new SeedVideoBench 1.5 compared to predecessor Seedance 1.0 Pro.

Breakthrough Assessment

8/10

Significant for integrating native audio-visual generation with a robust RLHF pipeline for video, a relatively new frontier. Strong practical improvements in lip-sync and speed.

⚙️ Technical Details

Problem Definition

Setting: Joint generation of video frames and audio waveforms conditioned on text or image inputs.

Inputs: Text prompt or Reference Image + Text prompt

Outputs: Synchronized Video frames + Audio track

Pipeline Flow

Input Processing (Text/Image Encoders)
Dual-Branch DiT (Joint Video/Audio Denoising)
Decoding (VAE Decoders)

System Modules

Input Encoders

Encode text prompts and/or reference images into latent representations

Model or implementation: Not reported in the paper

Dual-Branch DiT

Jointly denoise video and audio latents with cross-modal interaction

Model or implementation: MMDiT-based architecture

VAE Decoders

Decode latents into pixel-space video and waveform audio

Model or implementation: Not reported in the paper

Novel Architectural Elements

Dual-branch Diffusion Transformer with Cross-Modal Joint Module: Specifically designed to allow deep interaction between visual and auditory streams during the denoising process to ensure synchronization.

Modeling

Base Model: Seedance 1.5 pro (custom MMDiT architecture)

Training Method: Supervised Fine-Tuning (SFT) followed by RLHF (Reinforcement Learning from Human Feedback)

Objective Functions:

Purpose: Optimize generation toward human preferences in video/audio quality.

Formally: Multi-dimensional reward model (details not explicitly reported)

Adaptation: Not reported in the paper

Trainable Parameters: Not reported in the paper

Training Data:

Multi-stage curation pipeline with advanced captioning
Curriculum-based data scheduling
High-quality audio-video datasets for SFT

Compute: Inference accelerated >10x via distillation; RLHF pipeline optimization yielded 3x training speedup. Exact GPU hours not reported.

Comparison to Prior Work

vs. Veo 3.1: Seedance 1.5 pro excels in Chinese dialect generation and lip-sync accuracy.
vs. Sora 2: Seedance 1.5 pro claims better controlled expressiveness (avoiding over-exaggeration) compared to Sora 2's high emotional variance.
vs. Unimodal Video Models (e.g., older Hunyuan): Seedance 1.5 pro natively generates joint audio-visual streams rather than post-hoc audio generation [not cited in paper].

Limitations

Mastery of specific vocal styles across different opera sub-genres is still evolving.
Evaluation relies heavily on proprietary benchmarks (SeedVideoBench 1.5) and internal user preference studies.
No technical details provided on the specific structure of the 'Cross-Modal Joint Module' or the RLHF reward functions.

Reproducibility

Code: https://seed.bytedance.com/seedance1_5_pro

Code is not provided. Model is accessible via Volcano Engine (Model ID: Doubao-Seedance-1.5-pro). Training data and hyperparameters are not released. The paper serves as a technical report rather than a reproducible research paper.

📊 Experiments & Results

Evaluation Setup

Evaluation on SeedVideoBench 1.5 using both human raters and expert evaluation.

Benchmarks:

SeedVideoBench 1.5 (Video and Audio Generation Evaluation) [New]

Metrics:

Absolute Score (1-5 Likert scale)
GSB (Good-Same-Bad) preference ratio
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Video generation performance shows Seedance 1.5 pro achieving top-tier scores in absolute evaluations, particularly in instruction following.
SeedVideoBench 1.5 (T2V)	Absolute Score (Instruction Following)	4.15	4.30	+0.15
SeedVideoBench 1.5 (T2V)	Absolute Score (Visual Aesthetic)	4.32	4.28	-0.04
Audio generation performance (GSB) indicates strong preference for Seedance 1.5 pro in Chinese language contexts and lip-sync.
SeedVideoBench 1.5 (Audio T2V)	GSB (vs Veo 3.1)	Not reported in the paper	Not reported in the paper	Not reported in the paper

Main Takeaways

Seedance 1.5 pro demonstrates a clear advantage in audio-visual synchronization, particularly for lip-syncing and mitigating ventriloquism effects.
The model significantly outperforms competitors (Veo 3.1, Kling 2.6) in Chinese-language specific tasks, including dialects and cultural nuances like opera.
Inference acceleration techniques (distillation, quantization) successfully reduce NFE and boost speed by >10x without major quality degradation.
While Sora 2 shows higher emotional expressiveness variance, Seedance 1.5 pro prioritizes stability and narrative coherence, which is preferred for professional workflows.

📚 Prerequisite Knowledge

Prerequisites

Diffusion Transformers (DiT)
Reinforcement Learning from Human Feedback (RLHF)
Latent Diffusion Models
Model Distillation

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

MMDiT: Multimodal Diffusion Transformer—a neural network architecture that processes multiple modalities (like audio and video) using separate branches that communicate via attention mechanisms

RLHF: Reinforcement Learning from Human Feedback—a training method where a model is fine-tuned to maximize a reward signal derived from human preferences

SFT: Supervised Fine-Tuning—training the model on high-quality labeled datasets to establish baseline capabilities before RLHF

NFE: Number of Function Evaluations—the number of times the model's neural network is called during the generation process; fewer evaluations mean faster generation

GSB: Good-Same-Bad—a comparative evaluation metric where raters decide if one model's output is better, the same, or worse than another's

T2VA: Text-to-Video-Audio—generating both video and audio from a text description

I2VA: Image-to-Video-Audio—generating video and audio starting from a reference image

Dolly Zoom: A cinematic effect where the camera moves closer/further while zooming in the opposite direction, creating a warping perspective

Distillation: A compression technique where a smaller or faster model (student) learns to mimic the behavior of a larger or more complex model (teacher)

Orchid hand gesture: A stylized hand gesture used in traditional Chinese opera, used here to demonstrate the model's capability in generating culturally specific nuances

Nianbai: A form of spoken dialogue in Chinese opera, distinct from singing, used to test the model's handling of specific vocal styles