Movie Gen: A Cast of Media Foundation Models

📝 Paper Summary

Text-to-Video Generation Video-to-Audio Generation Video Editing Video Personalization

Movie Gen demonstrates that scaling standard Transformers with Flow Matching to 30B parameters achieves state-of-the-art results in HD video generation, personalization, editing, and synchronized audio without complex diffusion schedules.

Core Problem

Current commercial video generation systems often lack integrated capabilities for precise editing and personalization, and rely on complex diffusion noise schedules that are difficult to tune.

Why it matters:

Human imagination seamlessly composes motion, physics, and audio, but AI models typically treat these as separate, disconnected tasks.
Existing video models often struggle with temporal consistency and precise instruction-following (e.g., editing a specific object without changing the background).
Standard diffusion models do not guarantee zero terminal signal-to-noise ratio, requiring ad-hoc modifications for video generation.

Concrete Example: A user wants to generate a video of a specific person (personalization) and then 'add tinsel streamers to the lantern' (editing). Current systems might generate a random person or completely alter the scene composition when attempting the edit, whereas Movie Gen preserves identity and scene structure.

Key Novelty

Unified Scaling of Simple Flow Matching Transformers

Replaces complex diffusion schedules with Flow Matching, which naturally ensures zero terminal signal-to-noise ratio and simplifies training large-scale media models.
Employs a 'cast' of specialized but compatible foundation models (Video, Audio, Personalization, Editing) rather than a single monolithic black box.
Utilizes a massive 30B parameter Transformer backbone (Movie Gen Video) trained on internet-scale data, proving simple architectures scale effectively for video.

Architecture

The joint image and video generation pipeline using Flow Matching and Temporal Autoencoder (TAE).

Evaluation Highlights

Scales text-to-video generation to 30B parameters, supporting 1080p HD video at 16 frames-per-second for up to 16 seconds.
Generates 48kHz high-quality cinematic audio (sound effects and music) synchronized with video using a 13B parameter audio model.
Achieves state-of-the-art performance claims against Runway Gen3, LumaLabs, and OpenAI Sora on overall video quality (specific scores not in snippet).

Breakthrough Assessment

9/10

A massive engineering feat scaling video generation to 30B parameters with comprehensive capabilities (audio, editing, personalization) that matches or exceeds top commercial systems.

⚙️ Technical Details

Problem Definition

Setting: Conditional generation of high-definition video and audio from text prompts, optionally conditioned on images (for personalization) or source videos (for editing).

Inputs: Text prompt P, optional reference image (for personalization), optional source video (for editing)

Outputs: High-definition video V (up to 1080p) with synchronized audio A (48kHz)

Pipeline Flow

Input Processing: Text Encoders (UL2, ByT5, MetaCLIP) → Text Embeddings
Compression: Input Video → Temporal Autoencoder (TAE) → Latent Code
Generation: Latent Code + Text Embeddings + Noise → Movie Gen Video Transformer (Flow Matching) → Predicted Velocity
Decoding: Predicted Latents → TAE Decoder → Pixel Video
Upsampling: Pixel Video → Spatial Upsampler → 1080p HD Video
Audio Generation: Video Input + Text Prompt → Movie Gen Audio → 48kHz Audio

System Modules

Text Encoders

Convert text prompts into rich semantic and character-level embeddings

Model or implementation: Ensemble of UL2, ByT5, and Long-prompt MetaCLIP

Temporal Autoencoder (TAE)

Compress raw video into a spatio-temporally compact latent space

Model or implementation: Inflated Image VAE with 1D temporal convolutions/attention

Movie Gen Video

Generate video latents from noise conditioned on text

Model or implementation: 30B Parameter Transformer (LLaMa3-based, bidirectional attention)

Spatial Upsampler

Upscale generated video to full HD

Model or implementation: 7B Parameter Transformer (Video-to-Video)

Movie Gen Audio

Generate synchronized audio tracks

Model or implementation: 13B Parameter Transformer

Novel Architectural Elements

Use of Outlier Penalty Loss in Temporal Autoencoder to prevent high-norm latent artifacts ('spots').
Factorized learnable positional embeddings allowing arbitrary aspect ratios and lengths.
Integration of Flow Matching with a massive 30B standard Transformer backbone (unlike standard diffusion U-Nets).

Modeling

Base Model: Movie Gen Video (30B parameters) and Movie Gen Audio (13B parameters)

Training Method: Flow Matching with Optimal Transport path

Objective Functions:

Purpose: Train the model to predict the velocity vector field that transforms noise to data.

Formally: L_FM(θ) = E[ || v_t(X_t) - (X_1 - (1-σ_min)X_0) ||^2 ]
Purpose: Penalize high-norm outliers in the VAE latent space to prevent decoding artifacts.

Formally: L_OPL = mean(max(0, |x| - r)^2)

Training Data:

Pre-training: ~100M videos and ~1B images
Audio Pre-training: ~1M hours of audio
SFT: Curated high-quality video-text pairs

Key Hyperparameters:

max_context_length: 73,000 tokens
video_compression_factor: 8x spatial, 8x temporal
vae_channels: 16
+ 4 more
spatial_resolution_training: 768px (upsampled to 1080p)
inference_fps: 16 or 24
outlier_loss_weight: 1e5
outlier_threshold_r: 3

Compute: Training used up to 6,144 H100 GPUs (700W TDP) on Meta Grand Teton servers.

Comparison to Prior Work

vs. OpenAI Sora: Movie Gen uses Flow Matching instead of standard Diffusion, ensuring zero terminal SNR naturally.
vs. Runway Gen3: Movie Gen supports precise instruction-based editing and explicit personalization via post-training, features currently limited or absent in Gen3.
vs. Latent Diffusion Models (Stable Video Diffusion) [not cited in paper]: Movie Gen uses a 30B parameter Transformer backbone and Flow Matching rather than a U-Net with DDPM/DDIM schedules.

Limitations

Inference on 30B models with 73K context length is computationally expensive.
Spatial Upsampler relies on sliding windows, requiring MultiDiffusion to fix boundary inconsistencies.
Training requires massive infrastructure (6k+ H100 GPUs), limiting accessibility.
Specific quantitative win-rates against competitors are claimed in tables not visible in the text snippet.

Reproducibility

Code not provided. Weights not released. The paper provides extensive architectural details and hyperparameter tables (e.g., Table 1 in paper) but relies on proprietary internet-scale datasets (100M videos, 1B images) making direct replication impossible for most researchers.

📊 Experiments & Results

Evaluation Setup

Evaluation on newly constructed benchmarks for video and audio quality, utilizing human evaluation.

Benchmarks:

Movie Gen Video Bench (Text-to-Video generation quality) [New]
Movie Gen Audio Bench (Video-to-Audio generation quality) [New]

Metrics:

Video Quality (Human Eval)
Audio Quality (Human Eval)
Text Adherence
Audio-Video Synchronization
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
The paper claims state-of-the-art performance across multiple tasks but the specific numeric scores from Tables 6, 16, 18, 31, and 32 are not present in the provided text snippet. The following reflects the scale of the models validated.
Parameter Count	Parameters	Not reported in the paper	30,000,000,000	Not applicable
Training Context	Tokens	Not reported in the paper	73,000	Not applicable

Experiment Figures

Comparison of decoding artifacts with and without Outlier Penalty Loss (OPL)

Main Takeaways

Scaling simple architectures works: A standard LLaMa3-like Transformer with Flow Matching scales to 30B parameters and produces SOTA quality.
Unified capability: A single family of models covers video generation, editing, personalization, and audio, outperforming disjoint commercial systems (Runway, Sora, Pika, ElevenLabs) on their respective specific tasks.
Flow Matching is robust: It outperforms diffusion losses for video generation by naturally handling the zero terminal SNR requirement.
Pre-training scale is critical: The model uses 100M videos and 1B images to learn physics, motion, and audio-visual associations.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms)
Flow Matching vs. Diffusion Models
Latent Space representations (Autoencoders)
Model Parallelism (Tensor, Sequence, Context parallelism)

Key Terms

Flow Matching: A generative modeling framework that learns to transform a simple prior distribution (like noise) to a data distribution via a determined velocity field, often simpler to train than diffusion.

TAE: Temporal Autoencoder—a neural network that compresses video data spatially and temporally into a compact latent representation for efficient processing.

SFT: Supervised Fine-Tuning—training a pre-trained model on a smaller, high-quality dataset to improve instruction following and output quality.

Diegetic Audio: Sound that originates from a source within the video's world (e.g., footsteps, dialogue), as opposed to background music (non-diegetic).

Bi-directional Attention: An attention mechanism where every token can attend to every other token in the sequence, unlike causal attention used in text generation where tokens only attend to the past.

Latent Space: A compressed representation of data (images/video) where the generative model operates, reducing computational complexity compared to pixel space.

FSDP: Fully Sharded Data Parallel—a technique to distribute model parameters, gradients, and optimizer states across multiple GPUs to train models larger than single-GPU memory.

RoCE RDMA: RDMA over Converged Ethernet—a network protocol allowing direct memory access between GPU servers for high-speed training communication.

ODE solver: Ordinary Differential Equation solver—an algorithm used during inference in Flow Matching to compute the trajectory from noise to the final image/video.

SwiGLU: A specific activation function used in modern Transformers (like Llama) that combines the Swish activation with Gated Linear Units.