← Back to Paper List

mAVE: A Watermark for Joint Audio-Visual Generation Models

Luyang Si, Leyi Pan, Lijie Wen
arXiv (2026)
MM Speech Benchmark

📝 Paper Summary

Generative Media Watermarking Audio-Visual Security
mAVE cryptographically binds audio and video noise latents at initialization using inverse transform sampling to prevent cross-modal manipulation in joint generation models.
Core Problem
Existing watermarking schemes treat audio and video independently, allowing adversaries to perform Swap Attacks where valid watermarked video is paired with malicious deepfake audio.
Why it matters:
  • Current detectors use logical disjunction (Video_wm OR Audio_wm), falsely authenticating manipulated media if one modality remains valid
  • Cross-session splicing attacks allow attackers to harvest benign video and harmful audio from different sessions to bypass stricter checks
  • Post-hoc synchronization verifiers are too brittle to reliably intercept deepfake voiceovers in open-domain scenarios
Concrete Example: An attacker generates a watermarked video of a politician from a safe session. They then generate a deepfake audio threat from a separate session. They combine the safe video with the threat audio. Because the video watermark is valid, standard detectors (checking Video OR Audio) authenticate the hybrid asset as safe, destroying the vendor's reputation.
Key Novelty
Manifold Audio-Visual Entanglement (mAVE)
  • Intervenes at the noise initialization stage by cryptographically binding the audio noise to a hash of the video noise
  • Constructs a 'Legitimate Entanglement Manifold' where valid audio-visual pairs must satisfy a strict mathematical relationship
  • Uses Inverse Transform Sampling to map these entangled binary constraints into continuous Gaussian noise without altering the generation quality
Architecture
Architecture Figure Figure 2
The complete mAVE pipeline: from Bit Grid construction to Inverse Sampling and Joint Generation.
Evaluation Highlights
  • Achieves >99% binding integrity on state-of-the-art joint models (LTX-2, MOVA)
  • Provides a theoretical False Positive rate of < 9.86 * 10^-11 against Swap Attacks (with N=128 bits)
  • Maintains performance-losslessness, proving computational indistinguishability from standard Gaussian sampling
Breakthrough Assessment
9/10
First native watermarking framework for joint audio-visual models that mathematically guarantees cross-modal binding, solving a critical vulnerability (Swap Attacks) that renders previous methods ineffective.
×