Multi-Modal Learning with Missing Modality via Shared-Specific Feature Modelling

📝 Paper Summary

Multi-modal learning Missing modality adaptation

ShaSpec handles missing modalities in both segmentation and classification by disentangling features into shared (modality-consistent) and specific (modality-unique) components via auxiliary alignment and classification tasks.

Core Problem

Existing multi-modal models struggle when data modalities are missing during training or testing, often requiring complex, task-specific architectures that don't generalize well between classification and segmentation.

Why it matters:

Real-world medical data (e.g., MRI) often lacks specific sequences (modalities) due to cost or patient constraints, breaking models trained on full data.
Current solutions are either 'dedicated' (training separate models for every missing combination) or task-specific (hard to adapt from segmentation to classification).
Sophisticated generative approaches for missing data are often computationally heavy and unstable.

Concrete Example: In brain tumor segmentation (BraTS), a model might expect 4 MRI modalities (Flair, T1, T1ce, T2). If T1ce is missing for a patient, standard multi-modal models fail. Dedicated approaches would require training 15 different models for all possible missing combinations.

Key Novelty

Shared-Specific Feature Modelling (ShaSpec)

Decomposes input data into 'shared' features (consistent across all modalities) and 'specific' features (unique to each modality) using distinct encoders.
Enforces this separation via auxiliary tasks: 'Distribution Alignment' (making shared features indistinguishable by modality) and 'Domain Classification' (ensuring specific features predict their source modality).
Handles missing modalities by generating the missing 'shared' feature as an average of available shared features, while dropping the missing 'specific' feature.

Architecture

The ShaSpec architecture during full-modality training/evaluation.

Evaluation Highlights

On BraTS2018 (Brain Tumor Segmentation), improves state-of-the-art by >3% for enhancing tumor, >5% for tumor core, and >3% for whole tumor dice scores.
Outperforms competing methods like HeMIS and Robust-Mseg by a large margin on segmentation accuracy.
Demonstrates versatility by achieving state-of-the-art results on both medical image segmentation and standard computer vision classification tasks.

Breakthrough Assessment

8/10

Significant performance jumps (3-5%) on established benchmarks (BraTS) with a method that is notably simpler and more generalizable (handling both segmentation and classification) than prior complex generative approaches.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal learning with N modalities where any subset may be missing during training or evaluation.

Inputs: Set of available modalities M = {x^(i)} where x^(i) is the i-th modality input.

Outputs: Prediction y (segmentation map or classification label).

Pipeline Flow

Modality-specific Encoders (Shared & Specific streams)
Missing Feature Generation (if needed)
Residual Feature Fusion
Decoder / Prediction Head

System Modules

Specific Encoders (Encoding)

Extract features unique to each modality

Model or implementation: Modality-specific CNN encoders

Shared Encoder (Encoding)

Extract features consistent across modalities

Model or implementation: Weight-shared CNN encoder

Missing Feature Generator

Synthesize shared features for missing modalities

Model or implementation: Average pooling of available shared features

Projection & Fusion Layer

Combine shared and specific features into a comprehensive embedding

Model or implementation: Concatenation followed by convolution/projection

Decoder

Generate final task prediction

Model or implementation: Task-dependent decoder (CNN for segmentation, FC layers for classification)

Novel Architectural Elements

Dual-stream encoding per modality explicitly optimizing for shared vs. specific disentanglement via auxiliary losses
Simple averaging strategy for missing shared features combined with residual fusion of specific features

Modeling

Base Model: Task-dependent backbones (e.g., U-Net style for segmentation)

Training Method: Joint optimization of task loss and auxiliary disentanglement losses

Objective Functions:

Purpose: Ensure specific features identify the modality.

Formally: DCO minimizes Cross-Entropy between predicted modality and true modality label t^(i).
Purpose: Ensure shared features are modality-agnostic.

Formally: DAO minimizes Cross-Entropy (or KL divergence) between shared feature predictions and uniform distribution u^(i) (confusing the classifier).
Purpose: Optimize main task performance.

Formally: Standard task loss (e.g., Dice loss for segmentation, CE for classification).

Training Data:

BraTS2018 dataset (Medical Segmentation)
Standard computer vision classification datasets (implied context)

Key Hyperparameters:

alpha: Trade-off factor for DAO loss (value not explicitly in text snippet)
beta: Trade-off factor for DCO loss (value not explicitly in text snippet)

Compute: Not reported in the paper

Comparison to Prior Work

vs. HeMIS: ShaSpec disentangles shared/specific features rather than just pooling statistical moments.
vs. Robust-Mseg: ShaSpec avoids complex generative reconstruction of input images, focusing only on latent feature alignment, leading to simpler training and better stability.
vs. SMIL: ShaSpec uses explicit distribution alignment rather than meta-learning for feature reconstruction.
+ 1 more
vs. General SOTA: Adapts seamlessly to both segmentation and classification, whereas others are task-specific.

Limitations

Relies on the assumption that shared features can be approximated by averaging available ones.
Requires training auxiliary classifiers (DAO/DCO) alongside the main model.
Exact hyperparameters (alpha, beta) for loss balancing require tuning.

Reproducibility

Code: https://github.com/billhhh/ShaSpec/

Code is publicly available at https://github.com/billhhh/ShaSpec/.

📊 Experiments & Results

Evaluation Setup

Medical Image Segmentation with missing modalities

Benchmarks:

BraTS2018 (Brain Tumor Segmentation)
Computer Vision Classification (Multi-modal Classification)

Metrics:

Dice Score (Segmentation Accuracy)
Classification Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ShaSpec achieves substantial improvements over SOTA on the BraTS2018 segmentation benchmark.
BraTS2018	Dice Score (Enhancing Tumor)	Not reported in the paper	Not reported in the paper	> +3%
BraTS2018	Dice Score (Tumor Core)	Not reported in the paper	Not reported in the paper	> +5%
BraTS2018	Dice Score (Whole Tumor)	Not reported in the paper	Not reported in the paper	> +3%

Experiment Figures

The ShaSpec architecture during missing-modality scenarios.

Main Takeaways

ShaSpec outperforms competing methods like HeMIS, HVED, and Robust-Mseg by significant margins (3-5% improvements) on BraTS2018.
The method is effective for both dedicated (one model per missing pattern) and non-dedicated (one model for all patterns) training regimes.
Simplicity of design allows adaptation to both segmentation and classification, unlike prior task-specific architectures.

📚 Prerequisite Knowledge

Prerequisites

Multi-modal learning basics
Feature disentanglement
Domain adaptation concepts (domain classification/alignment)
Medical image segmentation metrics (Dice score)

Key Terms

ShaSpec: Shared-Specific Feature Modelling—the proposed architecture that disentangles features into modality-shared and modality-specific components.

Modality-shared features: Representations that capture information consistent across all input types (e.g., shape of a tumor visible in all MRI scans).

Modality-specific features: Representations capturing information unique to a single input type (e.g., texture specific to a T1-weighted MRI).

Distribution Alignment Objective (DAO): An auxiliary loss function that forces shared features from different modalities to have similar distributions, often by confusing a discriminator.

Domain Classification Objective (DCO): An auxiliary loss function ensuring specific features retain enough information to identify which modality they came from.

Dedicated training: Training a specific separate model for every possible combination of missing modalities.

Non-dedicated training: Training a single unified model that can handle various missing modality combinations dynamically.

BraTS: Multimodal Brain Tumor Segmentation Challenge—a standard benchmark dataset for medical image segmentation.

Dice score: A standard metric for evaluating segmentation accuracy, measuring overlap between predicted and ground truth regions.