DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency

📝 Paper Summary

Clinical Multi-Modal Learning Missing Modality Imputation Disentangled Representation Learning

DrFuse improves clinical prediction by disentangling modality-shared from modality-distinct features to handle missing data and using disease-aware attention to resolve conflicting modal information.

Core Problem

Clinical multi-modal learning faces two key challenges: frequent missing modalities (e.g., lack of X-rays) and inconsistent or contradictory signals between EHR and imaging data.

Why it matters:

In real-world datasets like MIMIC-IV, less than 20% of patients have X-ray images, rendering standard fusion invalid.
EHR and images can provide contradictory risk signals (e.g., meningitis symptoms in EHR vs. clear X-ray), causing confusion for standard models.
The diagnostic importance of each modality varies significantly depending on the specific patient and disease target.

Concrete Example: In mortality prediction, a patient with meningitis might show high risk in EHR data due to symptoms, while their Chest X-ray (CXR) appears normal. A standard fusion model might average these or get confused, whereas DrFuse learns to weigh the EHR higher for this specific disease context via attention ranking.

Key Novelty

Disentangled Representation with Disease-Aware Attention

Decomposes inputs into 'shared' (common to both EHR/CXR) and 'distinct' (unique to one) representations to robustly handle missing views.
Aligns shared representations via Jensen-Shannon Divergence minimization so the shared component can be inferred even if one modality is missing.
Uses a margin ranking loss to force the model to pay more attention to the modality that is more accurate for the specific disease being predicted.

Architecture

Overview of DrFuse framework showing the parallel encoding of EHR and CXR, the extraction of shared/distinct features, and the fusion mechanism.

Breakthrough Assessment

7/10

Addresses the critical and under-explored issue of modal inconsistency in clinical data. The disentanglement approach for missing data is theoretically sound, though quantitative results are not provided in the snippet.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal clinical prediction (classification) using Time-Series EHR and Medical Images (CXR) with partially missing image data.

Inputs: Patient EHR time series X_EHR and optionally Chest X-Ray X_CXR (often missing).

Outputs: Prediction labels y (e.g., phenotypes/disease diagnosis).

Pipeline Flow

Encoders (extract features from EHR/CXR)
Disentanglement (separate into Shared/Distinct reps)
Alignment (JSD Loss + Logit Pooling)
Fusion (Disease-aware Attention)
Prediction (Classification heads)

System Modules

EHR Encoders (Feature Extraction)

Extract features from EHR time series; splits into distinct and shared streams

Model or implementation: Transformer (2 separate encoders, sharing the first layer)

CXR Encoders (Feature Extraction)

Extract features from images; splits into distinct and shared streams

Model or implementation: ResNet50 (2 separate encoders)

Logit Pooling Layer

Aligns the distributions of shared representations from EHR and CXR so they contain common info

Model or implementation: Mathematical operation (Mixture of logits)

Disease-Aware Attention

Weights the three representations (EHR-distinct, CXR-distinct, Shared) based on their utility for the specific disease target

Model or implementation: Scaled Dot-Product Attention with Masking

Novel Architectural Elements

Logit Pooling mechanism for aligning shared representation distributions
Disease-aware attention module driven by a margin ranking loss based on auxiliary classifier performance

Modeling

Base Model: Transformer (EHR) and ResNet50 (CXR)

Training Method: End-to-end supervised learning with auxiliary alignment and ranking losses

Objective Functions:

Purpose: Align shared representations.

Formally: Jensen-Shannon Divergence (JSD) between EHR and CXR shared logits.
Purpose: Ensure distinct and shared features are different.

Formally: Orthogonality constraint (cosine similarity minimization) between distinct and shared reps.
Purpose: Enforce correct attention weighting.

Formally: Margin Ranking Loss comparing attention weights to the order of auxiliary classifier losses.
Purpose: Classification accuracy.

Formally: Cross-entropy loss on final prediction and auxiliary heads.

Compute: Not reported in the paper

Comparison to Prior Work

vs. Late Fusion: DrFuse learns shared latent interactions explicitly rather than just averaging outputs.
vs. Generative Methods: DrFuse aligns latent representations instead of trying to generate high-dimensional missing pixels (images) from low-dimensional tabular data (EHR).

Limitations

No quantitative results provided in the text snippet to verify performance claims.
Complexity of training with multiple auxiliary losses (ranking, orthogonality, JSD) might require careful tuning.
Assumes existence of a 'shared' information component between modalities which might be small in highly heterogeneous data.

Reproducibility

Code: https://github.com/dorothy-yao/drfuse

Code is publicly available on GitHub. Hyperparameters and specific training compute resources are not detailed in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Clinical phenotype classification using multi-modal ICU data.

Benchmarks:

MIMIC-IV (EHR-based clinical prediction)
MIMIC-CXR (Medical imaging analysis)

Metrics:

Not reported in the paper
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

The method handles missing modalities (specifically missing CXR) by relying on the shared representation derived from the available EHR data.
Quantitative results were not included in the provided text snippet, but the authors claim significant improvement over state-of-the-art models on MIMIC-IV phenotype classification.
The approach validates that explicitly modeling modal inconsistency via attention ranking is beneficial for heterogeneous clinical data.

📚 Prerequisite Knowledge

Prerequisites

Understanding of multi-modal fusion strategies (early vs. late fusion)
Basic knowledge of representation learning and disentanglement
Familiarity with clinical data types (EHR time series, X-ray images)

Key Terms

EHR: Electronic Health Records—digital records of patient health information, often time-series data like vitals and lab tests

CXR: Chest X-Ray—a medical imaging modality used to diagnose conditions affecting the chest

Disentangled Representation: A learning technique that separates feature variables into distinct, independent factors (here, shared vs. modality-specific)

JSD: Jensen-Shannon Divergence—a symmetrized and smoothed version of Kullback-Leibler divergence used to measure similarity between probability distributions

Logit Pooling: A proposed method to combine the logits of two distributions to find a mixture distribution, ensuring smooth alignment of shared representations

MIMIC-IV: A large-scale, publicly available dataset comprising de-identified health-related data associated with patients admitted to the ICU

Orthogonality Constraint: A loss term forcing two vector representations to be perpendicular (uncorrelated), ensuring distinct features don't overlap with shared features