mAVE: A Watermark for Joint Audio-Visual Generation Models

📝 Paper Summary

Generative Media Watermarking Audio-Visual Security

mAVE cryptographically binds audio and video noise latents at initialization using inverse transform sampling to prevent cross-modal manipulation in joint generation models.

Core Problem

Existing watermarking schemes treat audio and video independently, allowing adversaries to perform Swap Attacks where valid watermarked video is paired with malicious deepfake audio.

Why it matters:

Current detectors use logical disjunction (Video_wm OR Audio_wm), falsely authenticating manipulated media if one modality remains valid
Cross-session splicing attacks allow attackers to harvest benign video and harmful audio from different sessions to bypass stricter checks
Post-hoc synchronization verifiers are too brittle to reliably intercept deepfake voiceovers in open-domain scenarios

Concrete Example: An attacker generates a watermarked video of a politician from a safe session. They then generate a deepfake audio threat from a separate session. They combine the safe video with the threat audio. Because the video watermark is valid, standard detectors (checking Video OR Audio) authenticate the hybrid asset as safe, destroying the vendor's reputation.

Key Novelty

Manifold Audio-Visual Entanglement (mAVE)

Intervenes at the noise initialization stage by cryptographically binding the audio noise to a hash of the video noise
Constructs a 'Legitimate Entanglement Manifold' where valid audio-visual pairs must satisfy a strict mathematical relationship
Uses Inverse Transform Sampling to map these entangled binary constraints into continuous Gaussian noise without altering the generation quality

Architecture

The complete mAVE pipeline: from Bit Grid construction to Inverse Sampling and Joint Generation.

Evaluation Highlights

Achieves >99% binding integrity on state-of-the-art joint models (LTX-2, MOVA)
Provides a theoretical False Positive rate of < 9.86 * 10^-11 against Swap Attacks (with N=128 bits)
Maintains performance-losslessness, proving computational indistinguishability from standard Gaussian sampling

Breakthrough Assessment

9/10

First native watermarking framework for joint audio-visual models that mathematically guarantees cross-modal binding, solving a critical vulnerability (Swap Attacks) that renders previous methods ineffective.

⚙️ Technical Details

Problem Definition

Setting: In-processing watermarking for joint audio-visual generative models via latent initialization intervention

Inputs: Private key K_priv, Session Index I, Text/Prompt input

Outputs: Watermarked Joint Media (Video + Audio) x_wm

Pipeline Flow

Key Generation: Derive session key K_sess from secret m and index I
Entanglement: Generate Video Grid -> Hash to get Audio Grid constraints
Embedding: Diffuse/Randomize Grids -> Inverse Transform Sampling to create latents z_v, z_a
Generation: Joint Model (LTX-2/MOVA) denoises z_v, z_a into final media

System Modules

Key Derivation

Generate session-specific keys to prevent replay/analysis attacks

Model or implementation: HMAC-SHA256

Grid Generator

Create binary watermark grids where audio bits depend on video bits

Model or implementation: SHA-256 Hashing

Inverse Sampler

Map binary watermark grids to continuous Gaussian noise

Model or implementation: Inverse Probability Integral Transform (CDF^-1)

Joint Denoiser

Synthesize final media from entangled latents

Model or implementation: LTX-2 or MOVA (Asymmetric Bi-Transformers)

Novel Architectural Elements

Entangled Initialization Module: Replaces independent Gaussian sampling with a joint sampling process where audio noise is functionally dependent on video noise hash

Modeling

Base Model: LTX-2 and MOVA (Open-source Joint Audio-Visual Generation Models)

Training Method: Training-free Latent Initialization

Compute: Identical to standard inference cost; Detection requires 5-step ODE inversion

Comparison to Prior Work

vs. AudioSeal + VideoShield: mAVE binds modalities at initialization; combined unimodal methods allow Swap Attacks because verification is independent (OR logic)
vs. SyncNet [not cited in paper]: mAVE provides cryptographic certainty of binding, whereas post-hoc sync checkers are heuristic and brittle
vs. Gaussian Shading: mAVE extends the concept to joint multimodal manifolds rather than single-image noise coupling

Limitations

Relies on the invertibility of the generation process (ODE solvers); stochastic samplers may degrade recovery
Requires access to the specific generation model parameters for detection (inversion)
Numerical discretization errors in ODE solving can cause minor bit drift (mitigated by redundancy)

Reproducibility

Method is applied to open-source models LTX-2 and MOVA. Code availability is not explicitly provided in the text. ChaCha20 and SHA-256 are standard cryptographic primitives.

📊 Experiments & Results

Evaluation Setup

Watermark embedding and recovery on joint audio-visual generation

Benchmarks:

LTX-2 (Joint Audio-Visual Generation)
MOVA (Joint Audio-Visual Generation)

Metrics:

Binding Integrity (%)
False Positive Rate (P_fp)
Bit Accuracy (BA)
Statistical methodology: Hoeffding's Inequality for security bounds

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Theoretical Analysis	False Positive Probability (P_fp)	0.5	9.86e-11	Exponential Reduction
LTX-2 / MOVA	Binding Integrity	0	99	+99

Experiment Figures

Concept comparison: Decoupled Watermarking (Left) vs. mAVE Entangled Watermarking (Right) under Swap Attacks.

Main Takeaways

mAVE effectively prevents Swap Attacks where unimodal watermarks fail entirely due to independent verification
The method is performance-lossless, meaning the watermarked initialization is computationally indistinguishable from standard Gaussian noise
Detection is efficient, requiring only ~5 ODE inversion steps due to the straight trajectory of Rectified Flow models

📚 Prerequisite Knowledge

Prerequisites

Denoising Diffusion Probabilistic Models (DDPM)
Rectified Flow / Flow Matching
Inverse Transform Sampling
Cryptographic Hashing (SHA-256)

Key Terms

Swap Attack: An adversarial manipulation where valid watermarked content from one modality (e.g., video) is paired with fake/unwatermarked content from another (e.g., audio)

Manifold: In this context, a specific subset of the joint latent space where valid, entangled audio-video pairs reside

Inverse Transform Sampling: A method to generate random numbers from any probability distribution given its cumulative distribution function (CDF); used here to map bits to Gaussian noise

Rectified Flow: A generative model formulation that defines a straight path between noise and data, enabling efficient and invertible sampling via ODE solvers

ODE: Ordinary Differential Equation—used in diffusion models to deterministically map noise to data and vice versa

HMAC: Hash-based Message Authentication Code—a cryptographic construction for verifying integrity and authenticity using a secret key

Avalanche Effect: A property of cryptographic hashes where a small change in input results in a significantly different output, ensuring independence of bits

Hoeffding's Inequality: A theorem providing an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value