OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

📝 Paper Summary

Multi-modal generation Any-to-any generation

OmniFlow applies continuous rectified flow matching to multi-modal generation (text, image, audio), using a novel guidance scheme and model merging strategy to achieve high-quality any-to-any outputs.

Core Problem

Existing any-to-any generation models often struggle with balancing different modalities (e.g., audio vs. image) and determining the best modeling objectives for mixed-modal data.

Why it matters:

Balancing inputs is crucial; otherwise, one modality might dominate or degenerate
Directly fine-tuning base models on all tasks often leads to instability or poor performance due to unbalanced gradients
Aligning modeling choices (discrete vs. continuous) across modalities is non-trivial for unified systems

Concrete Example: When adding audio capabilities to an image model, simply lowering the learning rate or initializing randomly causes underperformance. OmniFlow instead merges models to stabilize training.

Key Novelty

OmniFlow (Multi-Modal Rectified Flows)

Extends the rectified flow formulation (used in SD3 for images) to audio and text modeling, finding it superior to discrete diffusion
Uses a novel multi-modal guidance scheme to balance inputs from different modalities, departing from prior work like CoDi
Employs a model merging strategy rather than direct SFT to add new capabilities, ensuring training stability and efficiency

Evaluation Highlights

Achieves lower FAD (1.79) using HiFiGen VAE compared to AudioMAE (2.03) for audio generation
Matches SD3 performance on image generation quality according to ImageReward, outperforming base SDv1.5
Demonstrates that joint training boosts individual tasks: Image-to-Audio generation improves via high-quality Text-to-Audio data

Breakthrough Assessment

7/10

Strong empirical results on extending flow matching to multi-modal settings and practical insights on training stability (merging vs. SFT). Performance matches state-of-the-art specialist models like SD3.

⚙️ Technical Details

Problem Definition

Setting: Any-to-any multi-modal generation

Inputs: Combinations of text, image, and audio

Outputs: Generated text, image, or audio

Pipeline Flow

Input Encoders (Image/Audio/Text) → QFormer (for text) → Joint DiT Backbone → Output Decoders (VAE/Vocoder)

System Modules

Audio VAE (Input/Output Processing)

Compress audio into latent space and reconstruct it

Model or implementation: HiFiGen (AudioLDM2 checkpoint)

Image Encoder (Input/Output Processing)

Compress images into latent space

Model or implementation: ConvNet (similar to SDXL VAE but with 16 channels)

Text Encoder / Adapter

Convert text embeddings to fixed-length latents

Model or implementation: QFormer

Novel Architectural Elements

Adoption of continuous rectified flow matching for audio and text in a joint multi-modal setting
QFormer usage specifically to adapt variable T5 text embeddings for continuous latent space modeling

Modeling

Base Model: Largely adopts the design of SD3 (Stable Diffusion 3)

Training Method: Model merging followed by joint training

Objective Functions:

Purpose: Generate data from noise.

Formally: Continuous rectified flow matching objective (implied from context)

Training Data:

Uses a mix of datasets balanced by heuristics (CLIP, Aesthetic scores)
Uses generated triplets for theoretical multi-modal alignment

Compute: Not reported in the paper

Comparison to Prior Work

vs. CoDi: OmniFlow uses generated triplets for better multi-modal alignment and a novel guidance scheme
vs. AudioLDM2: OmniFlow integrates audio generation into a unified any-to-any framework rather than being audio-specific
vs. SD3: OmniFlow extends the rectified flow and architecture to audio and text modalities

Limitations

Underperforms specialist MLLMs fine-tuned on specific target datasets (e.g., SLAM-AAC)
Requires generated triplets for training, unlike methods using only paired data
Training stability requires model merging; direct fine-tuning is unstable

Reproducibility

Code and training data will be released. The paper uses public checkpoints for components like AudioLDM2 (HiFiGen). Image encoder follows SD3 architecture.

📊 Experiments & Results

Evaluation Setup

Multi-modal generation evaluation across audio and image tasks

Benchmarks:

COCO (Image Generation)
SLAM-AAC (Audio Captioning / Generation)

Metrics:

FAD (Fréchet Audio Distance)
ImageReward
Aesthetic Score
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Ablation on audio VAE choice demonstrates the superiority of HiFiGen over AudioMAE for this pipeline.
Audio Generation Benchmark	FAD	2.03	1.79	-0.24
Image generation quality comparisons show OmniFlow is competitive with state-of-the-art specialist models.
Image Generation (General)	ImageReward	Not reported in the paper	Not reported in the paper	0
Audio-to-Image (A2I)	Aesthetic Score	Not reported in the paper	Not reported in the paper	+1.22

Main Takeaways

Continuous rectified flow is effective for audio/text generation, not just images (matches SD3 design)
Model merging is necessary for stable any-to-any training; direct SFT or random initialization fails
Joint training benefits individual tasks: T2I data improves A2I generation quality despite low-quality A2I training data
Multi-modal guidance and generated triplets are critical for high-quality generation and alignment

📚 Prerequisite Knowledge

Prerequisites

Rectified Flow Matching
Diffusion Models (Latent Diffusion)
VAE (Variational Autoencoder)
Multi-modal learning

Key Terms

Rectified Flows: A generative model class that learns straight paths between noise and data distributions, often allowing for fewer sampling steps than standard diffusion

VAE: Variational Autoencoder—a neural network that compresses data into a lower-dimensional latent space

FAD: Fréchet Audio Distance—a metric for evaluating the quality of generated audio by comparing its statistics to real audio

HiFiGen: A specific VAE and vocoder architecture used for high-fidelity audio generation

QFormer: Querying Transformer—a module that converts variable-length embeddings (like text from T5) into fixed-length latent representations

SFT: Supervised Fine-Tuning—training a model on labeled data

SD3: Stable Diffusion 3—a state-of-the-art image generation model using rectified flows

ImageReward: A metric trained on human preferences to evaluate image generation quality, considered more aligned with human judgment than FID