SleepFM: Multi-modal Representation Learning for Sleep Across Brain Activity, ECG and Respiratory Signals

📝 Paper Summary

Multi-modal Representation Learning Healthcare/Medical AI Time-series Analysis

SleepFM is the first multi-modal sleep foundation model, trained on 100,000+ hours of PSG data using a novel leave-one-out contrastive learning approach to integrate brain, cardiac, and respiratory signals.

Core Problem

Traditional sleep analysis relies on labor-intensive manual inspection or narrow supervised models that fail to leverage the full breadth of unlabeled physiological dynamics across diverse PSG sensors.

Why it matters:

Sleep monitoring is critical for diagnosing disorders and assessing overall brain, pulmonary, and cardiac health.
Existing supervised methods are limited by labeled data availability and do not utilize the rich, unlabelled relationships between different physiological modalities (brain, heart, lungs).

Concrete Example: A standard supervised CNN might classify sleep stages using only labeled EEG data, missing subtle correlations between heart rate variability (ECG) and breathing patterns (Respiratory) that indicate sleep-disordered breathing, leading to lower diagnostic accuracy.

Key Novelty

Leave-One-Out Contrastive Learning for Multi-modal Sleep Signals

Instead of just aligning pairs of signals (e.g., EEG vs ECG), the model trains one modality's embedding to predict the average embedding of all other remaining modalities.
This encourages each physiological signal (brain, heart, or lung) to capture global semantic information aligned with the entire holistic physiological state of the patient.

Architecture

Schematic of the Contrastive Learning frameworks (Pairwise vs. Leave-one-out) used to train SleepFM.

Evaluation Highlights

SleepFM (logistic regression on embeddings) outperforms end-to-end supervised CNNs on sleep stage classification (AUROC 0.88 vs 0.72).
Achieves superior Sleep Disordered Breathing (SDB) detection compared to supervised CNNs (AUROC 0.85 vs 0.69).
Retrieves correct corresponding recording clips across modalities with 48% top-1 accuracy from 90,000 candidates (vs ~0.001% random chance).

Breakthrough Assessment

8/10

First comprehensive multi-modal foundation model for sleep using a massive real-world dataset (100k hours). The novel leave-one-out contrastive approach shows significant empirical gains over standard pairwise methods.

⚙️ Technical Details

Problem Definition

Setting: Self-supervised representation learning on multi-channel time-series data followed by downstream classification

Inputs: 30-second clips of Polysomnography (PSG) data comprising Brain Activity Signals (BAS), ECG, and Respiratory signals

Outputs: Learned embeddings for downstream tasks (Sleep Staging classification, Sleep Disordered Breathing detection)

Pipeline Flow

Input Processing: Segment PSG into 30s clips across 3 modalities (BAS, ECG, Respiratory)
Encoders: 1D CNNs extract embeddings for each modality
Contrastive Training: Leave-one-out or Pairwise loss aligns embeddings
Downstream: Logistic Regression on frozen embeddings for classification

System Modules

Modality Encoders

Map raw time-series signals to latent embeddings

Model or implementation: 1D CNN based on EfficientNet architecture

Contrastive Loss Mechanism

Align representations of temporally localized clips across modalities

Model or implementation: Leave-one-out Contrastive Loss (or Pairwise)

Novel Architectural Elements

Leave-one-out contrastive topology: An architecture where N modality encoders feed into a loss function that contrasts single embeddings against the average of the N-1 other embeddings.

Modeling

Base Model: 1D EfficientNet-style CNNs

Training Method: Self-supervised Contrastive Learning

Objective Functions:

Purpose: Encourage an embedding from one modality to align with the average embedding of all other modalities for the same time segment.

Formally: L_LOO = - sum( log( exp(sim(x_i, x_avg_others) / tau) / sum( exp(sim(x_i, x_neg) / tau) ) ) )

Training Data:

Pretrain set: 11,261 participants (10.6M clips)
Train set: 1,265 participants (1.19M clips)
Test set: 1,401 participants (1.31M clips)

Key Hyperparameters:

learning_rate: 0.001
momentum: 0.9
batch_size: 32
+ 2 more
epochs: 20
learning_rate_decay: Factor of 10 every 5 epochs

Comparison to Prior Work

vs. Supervised CNN: Uses self-supervised pretraining on massive unlabeled data vs. task-specific labeled training.
vs. Pairwise CL: Aligns one modality to the consensus (average) of others, rather than just pair-by-pair alignment.
vs. ConVIRT [not cited in paper]: ConVIRT aligns image-text pairs; SleepFM aligns three continuous time-series modalities (Brain, Heart, Lung).
+ 1 more
vs. Oord et al. (InfoNCE): Adapts the contrastive loss to a multi-view setting (3+ modalities) via averaging mechanism.

Limitations

Respiratory modality retrieval performance is lower than BAS/ECG, likely due to signal variability and body motion artifacts.
Dataset is from a single center (Stanford Sleep Clinic), potentially limiting geographic generalization.
Requires synchronized multi-modal data which may not be available in consumer sleep trackers.
Clinical dataset privacy restrictions prevent public release of the training data.

Reproducibility

Code: https://github.com/rthapa84/sleepfm-codebase

Code is publicly available at https://github.com/rthapa84/sleepfm-codebase. Dataset is from Stanford Sleep Clinic (private clinical data), so full reproduction requires access to similar private PSG data.

📊 Experiments & Results

Evaluation Setup

Pretrain on large unlabeled corpus, freeze embeddings, train Logistic Regression for downstream tasks.

Benchmarks:

Sleep Stage Classification (Multi-class classification (Wake, Stage 1, Stage 2, Stage 3, REM))
Sleep Disordered Breathing (SDB) Detection (Binary classification)
Demographic Prediction (Classification (Age group, Gender))

Metrics:

AUROC
AUPRC
Recall@10
Median Rank
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
SleepFM (Leave-one-out) consistently outperforms supervised baselines and pairwise contrastive learning on sleep stage classification.
Sleep Stage Classification	Macro AUROC	0.72	0.88	+0.16
Sleep Stage Classification	Macro AUPRC	0.48	0.72	+0.24
SleepFM demonstrates superior performance in detecting Sleep Disordered Breathing compared to baselines.
SDB Detection	AUROC	0.69	0.85	+0.16
SDB Detection	AUPRC	0.61	0.77	+0.16
The model learns robust representations capable of extremely high-precision cross-modal retrieval.
Cross-modal Clip Retrieval	Recall@10	0.0001	0.80	+0.7999

Main Takeaways

Leave-one-out contrastive learning significantly improves downstream task performance compared to standard pairwise contrastive learning across sleep staging and SDB detection.
A simple logistic regression on SleepFM embeddings outperforms end-to-end supervised CNNs, suggesting the pretrained features are highly robust and generalized.
BAS (Brain Activity Signals) embeddings are strongest for demographics and sleep staging, while Respiratory embeddings excel at SDB detection, but fusing them via contrastive learning improves overall capability.
Retrieval performance is extremely strong (up to 48% top-1 accuracy), proving the model effectively aligns distinct physiological modalities in a shared latent space.

📚 Prerequisite Knowledge

Prerequisites

Contrastive Learning (e.g., CLIP, SimCLR)
1D Convolutional Neural Networks
Physiological signal processing (EEG, ECG)

Key Terms

PSG: Polysomnography—a comprehensive sleep study recording brain waves, oxygen level, heart rate, breathing, and eye/leg movements

BAS: Brain Activity Signals—a collective term used in this paper for EEG (brain), EOG (eye), and EMG (muscle) channels

SDB: Sleep Disordered Breathing—a group of disorders characterized by abnormalities in respiratory pattern (e.g., sleep apnea)

Leave-one-out CL: A contrastive learning strategy where one modality is contrasted against the average representation of all other modalities

AUROC: Area Under the Receiver Operating Characteristic curve—a performance metric for classification problems at various threshold settings

AUPRC: Area Under the Precision-Recall Curve—a performance metric valuable for imbalanced datasets

EfficientNet: A convolutional neural network architecture that scales depth, width, and resolution uniformly for better efficiency

Recall@10: The proportion of times the correct item is found within the top 10 retrieved results