MM-OpenFGL: A Comprehensive Benchmark for Multimodal Federated Graph Learning

📝 Paper Summary

Federated Graph Learning Multimodal Learning

MM-OpenFGL establishes the first comprehensive benchmark for Multimodal Federated Graph Learning (MMFGL), providing standardized datasets, simulation strategies for modality/topology heterogeneity, and a modular evaluation framework.

Core Problem

Existing Federated Graph Learning (FGL) frameworks focus on single-modality graphs and fail to address the specific challenges of multimodal data distribution, such as disjoint modalities and cross-client structural mismatches.

Why it matters:

Real-world multimodal graphs (e.g., social networks with text/images) are naturally distributed across platforms due to privacy and competition, preventing centralized training.
Naive application of standard FGL methods to multimodal data often performs worse than isolated training because they cannot reconcile cross-modal semantic conflicts.
The lack of a unified benchmark and formal problem definition hinders progress, leaving researchers without tools to rigorously evaluate MMFGL algorithms.

Concrete Example: Different social media platforms maintain independent user graphs where posts contain both text and images. Cross-platform sharing is prohibited. A naive federated aggregation might average weights from a text-heavy platform with an image-heavy one without alignment, degrading performance compared to training locally on just one modality.

Key Novelty

Tri-dimensional Simulation Strategy for MMFGL

Formalizes MMFGL by introducing simulation strategies across three axes: Modality (IID vs. NonIID/missing modalities), Topology (Available vs. Unavailable/hidden structure), and Label (IID vs. NonIID).
Integrates a modular pipeline supporting both End-to-End training (task-specific) and Two-Stage training (federated graph foundation models), allowing evaluation of pre-training transferability.

Evaluation Highlights

Multimodal GNNs outperform single-modality baselines (e.g., +significant margins for MM-GCN vs GCN), confirming the necessity of fusing topology and multimodal semantics.
Specialized heterogeneous FGL methods like MH-pFLID consistently outperform standard FL baselines in challenging Modality-NonIID settings across 7 datasets.
Graph Foundation Models (GFMs) demonstrate superior scalability, successfully handling complex downstream tasks like modality generation where traditional MM-GNNs fail.

Breakthrough Assessment

9/10

Foundational work that defines a new sub-field (MMFGL). It provides the first standardized benchmark, datasets, and problem formalization, filling a critical gap in federated learning research.

⚙️ Technical Details

Problem Definition

Setting: Distributed learning on Multimodal-Attributed Graphs (MMAGs) where graphs are partitioned across clients.

Inputs: Set of local graphs G_k = (V_k, E_k, X_k^M) where nodes have heterogeneous modalities M (e.g., text, image).

Outputs: Global model parameters w (End-to-End) or pre-trained backbone (Two-Stage) for downstream tasks.

Pipeline Flow

Modality Encoder (extracts features from raw data)
Local GNN/Model Training (Client-side optimization)
Federated Aggregation (Server-side model update)

System Modules

Modality Encoder

Extract features from raw multimodal data (text/images) to create node attributes

Model or implementation: Varied (e.g., Llama-3.2-1B, Qwen2-7B for text; ViG, DINOv2 for vision)

Local Client Model

Learn node representations using local graph structure and multimodal features

Model or implementation: MM-GNNs (e.g., MM-GCN, MGAT) or Foundation Models (e.g., OFA, GFT)

Server Aggregator

Aggregate client updates to update global model

Model or implementation: Aggregation algorithm (e.g., FedAvg, FedProx, FedSPA)

Novel Architectural Elements

Integration of extensive modality-specific simulation strategies (Modality-NonIID via Dirichlet distribution) directly into the FGL pipeline.
Support for Two-Stage Federated Graph Foundation Model pipeline (Pre-train then Fine-tune) within a standard FGL benchmark.

Modeling

Base Model: Varies (includes MM-GCN, MGAT, UniGraph, OFA, GFT)

Training Method: Federated Learning (iterative local update and global aggregation)

Objective Functions:

Purpose: Optimize task-specific performance (e.g., classification).

Formally: Standard Cross-Entropy or task-specific loss on local data.
Purpose: Align multimodal features (in Foundation Models).

Formally: Contrastive learning or masked feature reconstruction objectives.

Adaptation: Fine-tuning of pre-trained backbones or training from scratch

Trainable Parameters: GNN weights, projection layers (encoders often frozen to save compute)

Training Data:

19 datasets across 7 domains (e.g., Amazon-Books, Bili-Video)
Splits determined by simulation strategy (e.g., Dirichlet partition for NonIID)

Key Hyperparameters:

communication_rounds: Not explicitly reported in the paper
local_epochs: Not explicitly reported in the paper
optimizer: Not explicitly reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. OpenFGL: MM-OpenFGL adds multimodal data support (raw text/image), multimodal-specific tasks, and modality heterogeneity simulations.
vs. FedAvg/FedProx: MM-OpenFGL evaluates these generic methods on graph-specific and multimodal-specific challenges where they often fail.
vs. FedSPA/FedIIH: MM-OpenFGL tests these structure-focused methods under conditions of semantic (modal) heterogeneity.

Limitations

Computational cost of large-scale multimodal encoders (e.g., DINOv2) on resource-constrained edge clients is not deeply optimized.
Privacy guarantees of transmitted multimodal representations (embeddings) are evaluated but not theoretically proven safe against inversion attacks in the main text.
Specific hyperparameters for the baseline experiments (learning rates, batch sizes) are missing from the main text.

Reproducibility

Code: https://anonymous.4open.science/r/TEST-SA7D7A

Benchmark library, datasets, and leaderboards are publicly available at https://anonymous.4open.science/r/TEST-SA7D7A. Specific training hyperparameters (learning rates, batch sizes) for the baseline experiments are not detailed in the main text.

📊 Experiments & Results

Evaluation Setup

Federated training across simulated distributed clients using 19 multimodal datasets.

Benchmarks:

Bili series (Sequence recommendation / Link prediction)
PixelRec50K (Multimodal product graph tasks)
Social Media (Movies, RedditS) (Node classification)

Metrics:

Accuracy (Node Classification)
F1 Score (Node Classification)
Recall@K (Retrieval)
BLEU / ROUGE-L (Modality Generation)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of naive federated adaptations vs. isolated training shows that federated learning does not always guarantee improvement in multimodal settings.
Cloth (Node Classification)	Accuracy	82.45	78.32	-4.13
Evaluation of Graph Foundation Models (GFMs) against traditional architectures on complex tasks.
Flickr30k	BLEU-4	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Performance comparison of single-modality (MLP, GCN) vs. multimodal models (MM-GCN, MGAT) across datasets.

Radar charts comparing robustness of heterogeneous FGL methods (MH-pFLID, FedMVP, etc.) across 8 simulation settings.

Main Takeaways

Necessity: Multimodal GNNs consistently outperform single-modality models, but graph structure is also essential (GCN > MLP), validating the MM-FGL paradigm.
Effectiveness: Naive adaptations of FL to MM graphs often fail (negative transfer) due to semantic conflicts; specialized heterogeneous FGL methods are required.
Robustness: Modality-NonIID combined with Label-NonIID creates the most severe performance degradation; MH-pFLID is the most robust method in these settings.
Scalability: Graph Foundation Models (Two-Stage pipeline) generalize better to diverse downstream tasks (like generation) than task-specific MM-GNNs.

📚 Prerequisite Knowledge

Prerequisites

Federated Learning (FL) basics (FedAvg, client-server architecture)
Graph Neural Networks (GNNs)
Multimodal learning (feature fusion, encoders)

Key Terms

MMFGL: Multimodal Federated Graph Learning—training graph models across decentralized clients holding multimodal data (text, images, graph structure) without sharing raw data.

MMAG: Multimodal-Attributed Graph—a graph where nodes are associated with multiple types of data modalities (e.g., an image and a text description) and edges represent relationships.

Modality-NonIID: A simulation setting where different clients hold disjoint or partial sets of modalities (e.g., Client A has only text, Client B has only images).

Topology-Unavailable: A setting where the explicit graph structure (edges) is missing or hidden for privacy, requiring models to infer or reconstruct topology.

Graph Foundation Model (GFM): Large-scale pre-trained graph models designed to learn generic structural and semantic representations that can be fine-tuned for various downstream tasks.

MH-pFLID: A heterogeneous federated graph learning method designed to handle personalization and system heterogeneity.

FedAvg: Federated Averaging—the standard algorithm for aggregating local model updates into a global model by averaging weights.

Homophily: The tendency of nodes with similar labels or features to be connected in a graph.

End-to-End Pipeline: Standard federated learning approach where a task-specific model is trained from scratch via iterative communication.

Two-Stage Pipeline: A strategy involving federated pre-training of a foundation model followed by local fine-tuning on specific client tasks.