Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

📝 Paper Summary

Supervised Multi-modal Learning Probabilistic Graphical Models Model Ensembling / Fusion

I2M2 improves multi-modal learning by decomposing predictions into separate uni-modal and joint classifiers, ensuring the model captures both individual modality signals and cross-modality interactions.

Core Problem

Existing multi-modal methods typically focus on either fusing modalities (inter-dependency) or processing them independently (intra-dependency), often failing when the dataset's dominant signal type doesn't match the model's assumption.

Why it matters:

Multi-modal models sometimes paradoxically underperform simple uni-modal baselines when cross-modality interactions are weak or noisy.
Prior approaches lack a principled framework to explain performance discrepancies across different dataset types (e.g., healthcare vs. vision-language).
Real-world tasks vary unpredictably in their reliance on single-modality evidence versus complex cross-modal reasoning.

Concrete Example: In 'Tiger Detection', seeing a tiger shape (intra-modality) is sufficient for prediction regardless of texture, requiring strong intra-modality modeling. Conversely, in NLVR2, an image and text must be compared (inter-modality) to verify truthfulness. Models focusing on only one dependency type fail on the other task.

Key Novelty

Inter- & Intra-Modality Modeling (I2M2)

Views multi-modal data generation as a process where a label generates both individual modalities (intra) and a selection variable that governs their interaction (inter).
Decomposes the prediction into three explicit components: a classifier for Modality A, a classifier for Modality B, and a joint classifier for the (A, B) pair.
Combines these components via a Product of Experts (summing logits), allowing the system to dynamically leverage whichever dependency type is strongest for the specific datapoint.

Architecture

A comparison of generative processes (graphical models) for multi-modal learning: (a) The proposed joint model with selection variable v, (b) Inter-modality only model, and (c) Intra-modality only model.

Breakthrough Assessment

7/10

Provides a theoretically grounded explanation for common multi-modal failure modes and a robust, unified framework (I2M2) to address them. The approach is architectural-agnostic and principled.

⚙️ Technical Details

Problem Definition

Setting: Supervised classification mapping multiple input modalities to a target label

Inputs: Data from modality 1 (x) and modality 2 (x')

Outputs: Target label y

Pipeline Flow

Uni-modal Classifier 1 (x -> y)
Uni-modal Classifier 2 (x' -> y)
Multi-modal Classifier (x, x' -> y)
Log-Probability Aggregator (Product of Experts)

System Modules

Uni-modal Classifier A (Intra-modality Modeling)

Estimates the probability of the label given only the first modality

Model or implementation: Task-dependent (e.g., CNN or Transformer)

Uni-modal Classifier B (Intra-modality Modeling)

Estimates the probability of the label given only the second modality

Model or implementation: Task-dependent (e.g., CNN or Transformer)

Multi-modal Classifier

Estimates the probability of the label given the interaction of both modalities

Model or implementation: Task-dependent Fusion Model (Early or Intermediate Fusion)

Aggregator

Combines predictions from all experts

Model or implementation: Summation

Novel Architectural Elements

Triple-pathway inference: explicitly running isolated uni-modal models alongside a joint model and summing their log-probabilities
Generative-model-derived ensemble structure specifically targeting the 'selection variable' decomposition

Modeling

Base Model: Task-dependent (Not explicitly reported in the provided text snippet, typically uses standard backbones like ResNet/BERT/ViT depending on the dataset)

Comparison to Prior Work

vs. Inter-modality modeling: I2M2 adds explicit uni-modal pathways to prevent performance degradation when cross-modal signals are weak or absent.
vs. Intra-modality modeling: I2M2 includes a joint modeling pathway to capture complex interactions that simple ensembling misses.
vs. Uni-modal Learners: I2M2 ensures the system is at least as good as the best uni-modal predictor by including it in the ensemble.

Limitations

Computational cost increases as it requires training and running three separate classifiers (two uni-modal, one multi-modal).
Requires ground truth labels for supervision (cannot be applied zero-shot without training the components).

Reproducibility

Code: https://github.com/divyam3897/I2M2

Code is publicly available at https://github.com/divyam3897/I2M2. Specific hyperparameters and model architectures are not detailed in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Supervised classification across healthcare and vision-language tasks

Benchmarks:

fastMRI (Automatic diagnosis using knee MRI exams)
MIMIC-III (Mortality and ICD-9 code prediction)
VQA (Visual Question Answering)
NLVR2 (Natural Language for Visual Reasoning)

Metrics:

Not reported in the provided text
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The strength of modality dependencies varies significantly by task: fastMRI relies heavily on intra-modality signals, while NLVR2 relies on inter-modality signals.
Tasks like AV-MNIST, MIMIC-III, and VQA require both inter- and intra-modality dependencies for optimal performance.
I2M2 creates a robust system that performs well across all conditions by not committing to a single dependency assumption beforehand.

📚 Prerequisite Knowledge

Prerequisites

Probabilistic Graphical Models (conditional independence, generative processes)
Supervised Learning
Multi-modal Fusion (Early vs. Intermediate vs. Late)

Key Terms

Inter-modality dependencies: Statistical relationships between different modalities (e.g., image and text) and the label, capturing how they interact.

Intra-modality dependencies: Statistical relationships between a single modality (e.g., text only) and the label, independent of other modalities.

Selection variable (v): A binary random variable in the paper's generative model that induces the statistical dependency between modalities and the label (the 'interaction' mechanism).

Product of Experts: A technique to combine multiple probability distributions by multiplying them (or adding their logarithms) and normalizing.

I2M2: Inter- & Intra-Modality Modeling—the proposed framework that ensembles uni-modal and multi-modal classifiers.

Explanation Away: A phenomenon in graphical models where observing one cause explains the effect, changing the probability of other causes; used here to describe how modalities interact given the label.