← Back to Paper List

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Zichun Liao, Yusuke Kato, Kazuki Kozuka, Aditya Grover
Computer Vision and Pattern Recognition (2024)
MM Speech Pretraining

📝 Paper Summary

Multi-modal generation Any-to-any generation
OmniFlow applies continuous rectified flow matching to multi-modal generation (text, image, audio), using a novel guidance scheme and model merging strategy to achieve high-quality any-to-any outputs.
Core Problem
Existing any-to-any generation models often struggle with balancing different modalities (e.g., audio vs. image) and determining the best modeling objectives for mixed-modal data.
Why it matters:
  • Balancing inputs is crucial; otherwise, one modality might dominate or degenerate
  • Directly fine-tuning base models on all tasks often leads to instability or poor performance due to unbalanced gradients
  • Aligning modeling choices (discrete vs. continuous) across modalities is non-trivial for unified systems
Concrete Example: When adding audio capabilities to an image model, simply lowering the learning rate or initializing randomly causes underperformance. OmniFlow instead merges models to stabilize training.
Key Novelty
OmniFlow (Multi-Modal Rectified Flows)
  • Extends the rectified flow formulation (used in SD3 for images) to audio and text modeling, finding it superior to discrete diffusion
  • Uses a novel multi-modal guidance scheme to balance inputs from different modalities, departing from prior work like CoDi
  • Employs a model merging strategy rather than direct SFT to add new capabilities, ensuring training stability and efficiency
Evaluation Highlights
  • Achieves lower FAD (1.79) using HiFiGen VAE compared to AudioMAE (2.03) for audio generation
  • Matches SD3 performance on image generation quality according to ImageReward, outperforming base SDv1.5
  • Demonstrates that joint training boosts individual tasks: Image-to-Audio generation improves via high-quality Text-to-Audio data
Breakthrough Assessment
7/10
Strong empirical results on extending flow matching to multi-modal settings and practical insights on training stability (merging vs. SFT). Performance matches state-of-the-art specialist models like SD3.
×