V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

📝 Paper Summary

Video-to-Music Generation Multimodal Generation Zero-shot Learning

V2M-Zero enables temporally synchronized video-to-music generation without paired training data by aligning intra-modal event curves—representations of when change occurs—across video and music modalities.

Core Problem

Existing text-to-music models lack fine-grained temporal control for video, and current video-to-music models require large paired datasets that are often noisy or copyright-restricted.

Why it matters:

Content creators currently must manually edit videos to fit generated music, a tedious process hindering real-world adoption
Paired video-music datasets from the internet often contain vocals, imperfect mixing, or entangle semantic and temporal controls, limiting high-fidelity generation
Bridging modalities via text prompts alone (using LLMs) captures mood but fails to specify exact timing of beats or scene cuts

Concrete Example: In a product promotion video, musical cues (beats, dynamic changes) must align exactly with visual reveals or motion highlights. Current T2M models might generate the right 'upbeat' style but place the beat drops at random times unrelated to the visual cuts.

Key Novelty

Zero-Pair Synchronization via Event Curves

Decouples 'when' change occurs from 'what' changes: uses intra-modal similarity (self-similarity) to create 1D event curves that represent the tempo/structure of change independently of the modality (video or music)
Trains a music generator conditioned on music-derived event curves, then swaps in video-derived event curves at inference time to achieve synchronization without ever seeing paired video-music data

Architecture

The training and inference pipeline. Left side: Training on music with MusicFM extracting curves. Right side: Inference on video with Visual Encoder extracting curves. Center: The Flow Matching Transformer.

Evaluation Highlights

+21% to +52% improvement in temporal synchronization (interactive alignment) over paired-data baselines across OES-Pub, MovieGenBench-Music, and AIST++
+28% higher beat alignment specifically on AIST++ dance videos compared to baselines
5–21% higher audio quality (FAD/CLAP scores) compared to supervised methods trained on paired data

Breakthrough Assessment

8/10

Significantly outperforms supervised baselines without using paired data, offering a clever structural solution (event curves) to the modality gap. Solving fine-grained synchronization zero-shot is a strong contribution.

⚙️ Technical Details

Problem Definition

Setting: Zero-pair video-to-music generation

Inputs: Input video frames

Outputs: Background music track temporally aligned with video events (cuts, motions) and semantically aligned via text captions

Pipeline Flow

Visual Feature Extraction (Inference) / Music Feature Extraction (Training)
Event Curve Computation (Standardization & Smoothing)
Text Prompt Generation (via LLM summarizing video captions)
Rectified Flow Generation (Conditioned on Text + Event Curve)

System Modules

Feature Encoder (Input Processing)

Extract semantic features to calculate temporal change

Model or implementation: DINOv2-L (Video) / MusicFM (Music)

Event Curve Computer (Input Processing)

Compute 1D signal of temporal change

Model or implementation: Deterministic algorithm (Cosine Similarity + Standardization + Hann Smoothing)

Text Conditioner (Generation)

Provide semantic/mood control

Model or implementation: Gemma-3B (Text Encoder) + LLM Summarizer (Vibe)

Flow Generator (Generation)

Generate music latents

Model or implementation: DiT (Diffusion Transformer, ~1B parameters)

Audio Decoder (Generation)

Convert latents to waveform

Model or implementation: Pretrained Audio Autoencoder

Novel Architectural Elements

Cross-modal substitution mechanism: Training on Music-Event Curves and swapping for Video-Event Curves at inference without retraining
Standardization and smoothing pipeline specifically designed to make video and music temporal dynamics statistically comparable

Modeling

Base Model: Pretrained DiT-based Rectified Flow Model (~1B params)

Training Method: Supervised Fine-Tuning (Flow Matching)

Objective Functions:

Purpose: Learn velocity field to transport noise to data.

Formally: L(θ) = E[ || v_θ(x_t, c, e) - (x_0 - ε) ||^2 ]

Adaptation: Fine-tuning of full model (plus new projection layer for event curve)

Trainable Parameters: Full DiT (~1B params) + 2048 new input projection parameters

Training Data:

~25k hours of licensed instrumental music-text pairs

Key Hyperparameters:

learning_rate: 1e-4
optimizer: AdamW
batch_size: Not reported in the paper
+ 2 more
sampling_steps: 96
guidance_scale: Classifier-free guidance (dropout 10%)

Compute: 192-768 GPU hours (2-4 days on 4-8 A100 GPUs)

Comparison to Prior Work

vs. CMT: V2M-Zero generates high-fidelity audio directly, not symbolic MIDI
vs. M2UD/V-Musician: V2M-Zero requires NO paired video-music data, avoiding noisy internet data issues
vs. Sonique: V2M-Zero uses explicit temporal curves for fine-grained sync, whereas Sonique relies on text prompts which lack temporal precision

Limitations

Relies on the assumption that visual changes (cuts) should map to musical changes (beats), which holds often but not always
Performance depends on the quality of the underlying pretrained T2M model
Requires explicit feature extraction steps at inference, adding slight overhead compared to end-to-end models
Visual event curves might need different encoders for specific domains (e.g., motion trackers for dance vs. foundation models for general video)

Reproducibility

Code: https://genjib.github.io/v2m_zero/

Publicly available at project page. Uses proprietary/internal pretrained T2M model as base (similar to recent work like Dragon or Presto), which may hinder exact reproduction from scratch without access to that specific checkpoint. Uses open components for encoders (MusicFM, DINOv2) and text (Gemma-3B).

📊 Experiments & Results

Evaluation Setup

Video-to-music generation across general, cinematic, and dance domains

Benchmarks:

OES-Pub (Cinematic/Soundtrack generation)
MovieGenBench-Music (General video music generation)
AIST++ (Dance video music generation)

Metrics:

FAD (Fréchet Audio Distance) - Audio Quality
CLAP Score - Semantic Alignment
ImageBind-Align - Semantic Alignment
Beat Align Score - Temporal Synchronization (Dance)
Interactive Alignment - Temporal Synchronization (General)
Statistical methodology: Large crowd-source subjective listening test (user study) included

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
V2M-Zero demonstrates superior temporal synchronization and audio quality compared to paired baselines across multiple benchmarks.
AIST++	Beat Align Score	0.201	0.258	+0.057
MovieGenBench-Music	Interactive Alignment (Temporal)	3.12	4.75	+1.63
OES-Pub	FAD (Lower is better)	3.85	3.15	-0.70
OES-Pub	CLAP Score (Semantic)	0.21	0.24	+0.03

Experiment Figures

Visualization of event curves from music and video, showing how peaks align with events.

Main Takeaways

Zero-pair training (via event curves) outperforms supervised paired training on both temporal alignment and audio quality metrics.
The method is flexible: simply swapping the visual encoder (e.g., CoTracker for dance vs DINOv2 for scenes) adapts the model to different video domains without retraining.
Standardization and smoothing of event curves are critical for bridging the modality gap between music and video features.
Decoupling 'when' (event curve) and 'what' (text prompt) allows for independent and precise control over both timing and style.

📚 Prerequisite Knowledge

Prerequisites

Rectified Flow Matching / Diffusion Models
Cross-modal retrieval/generation
Audio signal processing (waveforms, latents)

Key Terms

event curve: A 1D temporal signal representing the magnitude of change over time, calculated via cosine similarity between consecutive feature vectors within a single modality

rectified flow: A generative model that learns a transport map (velocity field) to transform a simple prior distribution (noise) into the data distribution via an ordinary differential equation (ODE)

DiT: Diffusion Transformer—a neural network architecture that uses transformers instead of U-Nets for the backbone of diffusion-based generative models

intra-modal similarity: The similarity between data points (e.g., frames or audio segments) within the same modality, used here to detect structure regardless of content

zero-pair: A training setting where the model never sees paired examples of input (video) and output (music) together; it learns from independent datasets

FAD: Fréchet Audio Distance—a metric for evaluating the quality of generated audio by comparing statistics of embeddings

CLAP: Contrastive Language-Audio Pretraining—a model used to measure semantic similarity between audio and text (or video)

beat alignment: A metric measuring how well musical beats coincide with visual events like dance moves or scene cuts