MoD-DPO: Towards Mitigating Cross-modal Hallucinations in Omni LLMs using Modality Decoupled Preference Optimization

📝 Paper Summary

Multimodal Large Language Models (Omni-LLMs) Hallucination Mitigation Direct Preference Optimization (DPO)

MoD-DPO reduces cross-modal hallucinations in omni-modal LLMs by explicitly decoupling modalities during preference optimization, enforcing invariance to irrelevant modality corruption and sensitivity to relevant modality corruption.

Core Problem

Omni-modal LLMs suffer from cross-modal hallucinations (e.g., hearing imaginary sounds from visual cues) due to spurious inter-modality correlations and over-reliance on strong language priors.

Why it matters:

Models hallucinate non-existent events (e.g., 'hearing' a dog because one is visible) when audio/video evidence is weak or asynchronous.
Existing multimodal DPO methods do not explicitly decouple modality pathways, allowing models to retain latent language-only shortcuts.
Reliable audiovisual understanding is critical for agents that must 'see and listen' accurately before reasoning.

Concrete Example: When an omni-LLM sees a video of a dog but the audio is silent, it might hallucinate a 'barking' sound solely because dogs usually bark (spurious correlation/language prior). MoD-DPO prevents this by ensuring the model's audio prediction doesn't change if the video is corrupted (invariance).

Key Novelty

Modality-Decoupled Direct Preference Optimization (MoD-DPO)

Adds regularization terms to the DPO objective that force the model's output to stay stable when the *irrelevant* modality is corrupted (invariance).
Forces the model's output to shift significantly when the *relevant* modality is corrupted (sensitivity), ensuring true grounding.
Incorporates a Language Prior Debiasing (LPD) penalty that specifically reduces rewards for responses that could be generated by text priors alone.

Architecture

Comparison of Vanilla DPO vs. MoD-DPO objectives. It visualizes how MoD-DPO splits the input into relevant/irrelevant modalities and applies specific regularization.

Evaluation Highlights

+27% accuracy improvement on audiovisual matching tasks in AVHBench compared to the reference model (Qwen 2.5 Omni).
Outperforms baselines (Vanilla DPO, OmniDPO, V-DPO) on CMM benchmark for both perception accuracy and hallucination resistance.
Achieves these gains while maintaining general audiovisual capabilities on benchmarks like MVBench and MMAU.

Breakthrough Assessment

8/10

Strong methodological contribution by deriving a closed-form solution for modality-decoupled DPO. Significant empirical gains (+27%) on specific hallucination tasks address a critical failure mode of multimodal models.

⚙️ Technical Details

Problem Definition

Setting: Preference optimization for multimodal LLMs handling Audio (a), Video (v), and Text (x) inputs.

Inputs: Multimodal prompt containing audio, video, and text instructions.

Outputs: Text response y grounded in the specific relevant modality (audio or video).

Pipeline Flow

Input Processing: Separate Audio/Video
Data Generation: Create corrupted/hard-negative samples
Training: MoD-DPO Optimization Loop

System Modules

Policy Model

The multimodal LLM being aligned

Model or implementation: Qwen 2.5 Omni (7B) or MiniCPM-O 2.6 (8B)

Corrupter

Generates perturbed inputs to enforce invariance/sensitivity

Model or implementation: Diffusion-based noise or random segment swapping

Novel Architectural Elements

Loss Function Architecture: Incorporates closed-form optimal policy derivation including terms for irrelevant-modality invariance and relevant-modality sensitivity.
Language Prior Debiasing term specifically added to the reward formulation.

Modeling

Base Model: Qwen 2.5 Omni (7B) and MiniCPM-O 2.6 (8B)

Training Method: Modality-Decoupled Direct Preference Optimization (MoD-DPO)

Objective Functions:

Purpose: Optimize preferences while enforcing modality grounding.

Formally: Maximizes reward r(x,y) derived from optimal policy that balances standard DPO KL-constraint with invariance (stability against irrelevant corruption) and sensitivity (instability against relevant corruption).
Purpose: Penalize language-only hallucinations.

Formally: Subtracts γ_LPD * log(π_text(y|x)) from the reward to discourage text-only priors.

Adaptation: Full fine-tuning (implied by DPO context)

Training Data:

18.1k automatically generated preference samples spanning 10.8k unique videos.
Sources: MSR-VTT, VALOR32K, AudioCaps.
Negatives generated by including spurious information from the irrelevant modality (e.g., audio details in a visual question).

Key Hyperparameters:

learning_rate: 3e-7
batch_size: 1 per GPU
epochs: 1 (MoD-DPO) vs 4 (Baselines) to match compute
+ 5 more
beta: 0.1
beta_sens: 0.05
beta_inv: 0.02
gamma_LPD: 0.05
gpu_config: 8 H100 GPUs

Compute: Training takes ~1/4 the epochs of baselines to match compute budget due to extra forward passes (though gradient-free) for corrupted inputs.

Comparison to Prior Work

vs. OmniDPO/V-DPO: MoD-DPO derives a closed-form solution that explicitly includes invariance/sensitivity terms in the objective, rather than just using data augmentation or simple preference pairs.
vs. All: Explicitly penalizes the language prior (LPD) within the reward function formulation.
vs. VCD (Visual Contrastive Decoding): MoD-DPO is a training-time alignment method that changes internal decision boundaries, whereas VCD is a decoding-time defense.

Limitations

Requires additional forward passes for corrupted inputs during training, increasing computational overhead per iteration.
Relies on synthetic data generation pipelines (GPT-4o, etc.) which may introduce their own biases.
Performance gains on general benchmarks (MVBench/MMAU) are modest compared to the large gains on hallucination benchmarks.

Reproducibility

Code: https://github.com/mod-dpo/mod-dpo.github.io

Publicly available project page. Detailed hyperparameters provided. Preference dataset generation pipeline fully described (using GPT-4o, AudioFlamingo 3, RAM++). Code modifications to LLaMA-Factory mentioned.

📊 Experiments & Results

Evaluation Setup

Post-training evaluation on hallucination and general perception benchmarks.

Benchmarks:

AVHBench (Audiovisual Hallucination Evaluation)
CMM (Curse of Multi-Modalities) (Hallucination & Prior Reliance)
DailyOmni/MVBench/MMAU (General Audio/Visual Understanding)

Metrics:

Accuracy
F1 Score
Perception Accuracy (PA)
Hallucination Resistance (HR)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
MoD-DPO significantly outperforms baselines on the AVHBench hallucination benchmark, particularly in audiovisual matching.
AVHBench (AV-Matching)	Accuracy	46.2	73.2	+27.0
AVHBench (Overall)	F1	51.1	57.7	+6.6
On the CMM benchmark, MoD-DPO reduces reliance on priors and improves hallucination resistance.
CMM (Overall)	Hallucination Resistance (HR)	39.1	43.3	+4.2
AVHBench	Recall	50.5	55.8	+5.3

Experiment Figures

Radar charts comparing MoD-DPO against baselines (OmniDPO, V-DPO, etc.) across various hallucination metrics on AVHBench and CMM.

Main Takeaways

MoD-DPO consistently improves both perception accuracy and hallucination resistance across multiple benchmarks compared to standard DPO and multimodal DPO variants.
The Language Prior Debiasing (LPD) penalty is critical for reducing language dominance; removing it significantly drops performance on language-prior specific tasks.
Using diffusion-based noise for corruption yields better performance than random swapping or simple Gaussian noise.
The method is efficient: achieves better results with only 1 training epoch compared to 4 epochs for baselines (matching compute budget).

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Multimodal LLM architectures (e.g., Qwen-Omni)
KL Divergence
Hallucination in VLM/MLLMs

Key Terms

DPO: Direct Preference Optimization—a method to align models to preferences without a separate reward model.

Omni-LLM: Large Language Models capable of processing and reasoning over text, audio, image, and video simultaneously.

Cross-modal hallucination: When a model perceives entities in one modality (e.g., audio) solely because they appear in another (e.g., video), without actual evidence.

Language Prior Debiasing (LPD): A penalty term introduced to suppress model responses that are driven purely by the language model's internal statistical priors rather than sensory input.

Invariance: The property where the model's prediction remains stable even if the irrelevant modality (e.g., audio for a visual question) is corrupted.

Sensitivity: The property where the model's prediction changes drastically if the relevant modality (e.g., video for a visual question) is corrupted.