Data-Efficient Sleep Staging with Synthetic Time Series Pretraining

📝 Paper Summary

Sleep Staging EEG Analysis Synthetic Data Pretraining

FPT pretrains neural networks to predict the frequency content of synthetically generated random time series, enabling accurate sleep staging on small real-world datasets without requiring large empirical pretraining data.

Core Problem

Deep learning for sleep staging typically requires large, diverse datasets to generalize across subjects, but medical data is often scarce, expensive to label, and restricted by privacy concerns.

Why it matters:

Acquiring large-scale EEG datasets is logistically difficult due to varying clinical protocols and strict ethical guidelines.
Small datasets lead to overfitting, preventing deep neural networks from learning robust features necessary for reliable medical diagnostics.
Current self-supervised methods still rely on large amounts of unlabeled empirical data, which may not always be available.

Concrete Example: When fine-tuning on only 50 samples from a single subject, standard supervised models overfit and fail to generalize. FPT (Frequency Pretraining) leverages synthetic priors to achieve significantly higher accuracy in this low-data scenario.

Key Novelty

Frequency Pretraining (FPT)

Generates synthetic time series solely by summing sine waves with random frequencies drawn from predefined bins, completely eliminating the need for real EEG data during pretraining.
Trains a neural network to predict which frequency bins were used to generate the signal, forcing the model to learn frequency-discriminative features relevant for sleep staging.

Architecture

Illustration of the two-phase training process: (1) Pretraining on synthetic time series for frequency prediction, and (2) Fine-tuning on clinical sleep data for sleep staging.

Evaluation Highlights

Outperforms fully supervised baselines by 0.06–0.07 Macro-F1 in low-data regimes (50 samples) across three datasets.
Matches performance of fully supervised methods (Macro-F1 0.76–0.81) when abundant data is available, confirming synthetic priors do not hinder capacity.
Achieves comparable results to standard self-supervised methods (SimCLR, VICReg) trained on real EEG data, but without using any real data for pretraining.

Breakthrough Assessment

7/10

Strong proof-of-concept that synthetic data alone can replace large empirical datasets for pretraining in specific domains like EEG, offering a valuable tool for data-scarce medical applications.

⚙️ Technical Details

Problem Definition

Setting: Sleep stage classification from single-channel EEG/EOG time series using transfer learning from synthetic data.

Inputs: Sequence of 11 consecutive sleep epochs (30s each), centered on the target epoch i.

Outputs: Predicted sleep stage for epoch i (Wake, N1, N2, N3, REM).

Pipeline Flow

Input Sequence (11 epochs) -> Feature Extractor (shared) -> Feature Aggregation -> Classifier -> Sleep Stage Prediction

System Modules

Feature Extractor

Extracts latent features from each of the 11 input epochs independently.

Model or implementation: Not explicitly detailed (likely CNN-based given context of EEG processing)

Classifier

Aggregates features from the sequence and predicts the sleep stage of the central epoch.

Model or implementation: Dense/Linear layers (implied)

Novel Architectural Elements

Pretraining task designed specifically for frequency prediction on synthetic data, transferring frequency-discriminative priors to the sleep staging task.

Modeling

Base Model: Custom Deep Neural Network (feature extractor + classifier)

Training Method: Two-phase: (1) Pretraining on synthetic data (multi-label classification), (2) Fine-tuning on real sleep data (multi-class classification).

Objective Functions:

Purpose: Pretraining task—predict presence of frequencies in synthetic signal.

Formally: Multi-label classification loss (likely Binary Cross Entropy per bin).
Purpose: Fine-tuning task—classify sleep stage.

Formally: Cross-entropy loss (implied standard for classification).

Adaptation: Fine-tuning either the full network or just the classifier head.

Training Data:

Synthetic data: Sum of sine waves with random frequencies.
Real data: DODO/H, Sleep-EDFx, ISRUC datasets (single channel EEG).

Key Hyperparameters:

pretraining_samples: Up to 10^6 synthetic samples for diversity analysis
fine_tuning_samples: Varied (50, 130, 340, 900, all)
cross_validation: 5-fold
+ 1 more
repetitions: 3

Compute: Not reported in the paper

Comparison to Prior Work

vs. SimCLR/VICReg: FPT uses synthetic data for pretraining instead of empirical data, avoiding the need for large unlabeled datasets.
vs. Fully Supervised: FPT leverages priors from synthetic frequency prediction to improve performance in low-data regimes.

Limitations

Low-frequency prediction accuracy (below 1Hz) is lower than high-frequency prediction in the pretraining task.
Relies on the assumption that frequency features are the primary driver for sleep staging performance.
Requires high diversity in synthetic samples (approx 10,000) to reach performance plateaus.

Reproducibility

Code: https://github.com/NiklasGrieger/synthetic-sleep-staging

Code is publicly available at https://github.com/NiklasGrieger/synthetic-sleep-staging. Datasets (DODO/H, Sleep-EDFx, ISRUC) are publicly available benchmarks.

📊 Experiments & Results

Evaluation Setup

Sleep stage classification (5 classes: Wake, N1, N2, N3, REM) on single-channel EEG.

Benchmarks:

DODO/H (Sleep Staging)
Sleep-EDFx (Sleep Staging)
ISRUC (Sleep Staging)

Metrics:

Macro-F1 score
Statistical methodology: Paired TOST (Two One-Sided Tests) for equivalence; Bootstrap estimation for mean differences; Wilcoxon signed-rank test.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Low-data regime (50 samples) results showing superiority of pretraining over supervised baselines.
Average across datasets	Macro-F1 gain	Not reported in the paper	Not reported in the paper	+0.06–0.07
Average across datasets	Macro-F1 gain	Not reported in the paper	Not reported in the paper	+0.10–0.17
High-data regime results demonstrating that synthetic pretraining matches fully supervised performance.
Average across datasets	Macro-F1	0.76–0.80	0.76–0.81	0.00–0.01
DODO/H (High-data)	Macro-F1	0.80	0.81	+0.01
DODO/H (High-data)	Macro-F1	0.80	0.81	+0.01

Experiment Figures

Comparison of Macro-F1 scores for different training configurations (Fully Supervised, Fine-Tuned Feature Extractor, Fixed Feature Extractor, Untrained) in low-data and high-data regimes.

Heatmaps and plots showing the impact of subject diversity and sample volume on model performance.

Analysis of the pretraining process: loss convergence, Hamming accuracy, frequency-specific accuracy, and impact of synthetic sample diversity.

Main Takeaways

FPT (Frequency Pretraining) on synthetic data consistently outperforms fully supervised training in low-data regimes (few samples or few subjects).
Synthetic pretraining achieves parity with fully supervised methods when large datasets are available, indicating no negative transfer.
Pretraining on synthetic data is competitive with self-supervised learning (SimCLR, VICReg) on real EEG data, eliminating the need for empirical pretraining data.
Diversity of synthetic samples is crucial; performance plateaus around 10,000 synthetic samples.

📚 Prerequisite Knowledge

Prerequisites

Time series analysis (Fourier transform, frequency domains)
Deep learning for sequence modeling
Transfer learning and pretraining paradigms

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

FPT: Frequency Pretraining—the proposed method of pretraining a model to predict frequency content of synthetic signals.

EEG: Electroencephalography—a method to record electrical activity of the brain.

Macro-F1: The average F1 score (harmonic mean of precision and recall) calculated per class and then averaged, treating all classes equally.

SimCLR: A simple framework for contrastive learning of visual representations—a self-supervised learning method.

VICReg: Variance-Invariance-Covariance Regularization—a self-supervised learning method for learning image representations.

Hamming metric: A metric used to evaluate multi-label classification, calculating the fraction of correctly predicted labels to the total number of labels.

Kaiming normal initialization: A method for initializing neural network weights to maintain variance of activations, helping deep networks converge.

TOST: Two One-Sided Tests—a statistical equivalence test used to determine if two means are practically equivalent within a margin.

EOG: Electrooculography—a technique for measuring the resting potential of the retina.

Epoch: In sleep scoring, a 30-second segment of physiological data assigned a single sleep stage label.