A Supervised Contrastive Learning Pretrain-Finetune Approach for Time Series

📝 Paper Summary

Time Series Forecasting Foundation Models for Time Series Transfer Learning

The paper proposes a supervised contrastive learning framework that pretrains a model to distinguish between different time series datasets, then uses probability estimates of dataset similarity to guide finetuning on target data.

Core Problem

Training foundation models for time series is difficult because pretraining datasets often have vastly different dynamics from target finetuning datasets, leading to poor transfer performance.

Why it matters:

Standard foundation models struggle to adapt to heterogeneous collections of time series data where dynamics vary significantly between domains (e.g., electricity vs. traffic).
Existing approaches often fail to explicitly leverage the relationships between the pretraining source domains and the specific target domain during the finetuning phase.

Concrete Example: A model pretrained on electricity data might fail to predict traffic flow accurately because the underlying temporal dynamics are different. The paper shows that without guidance, a pretrained model can have high error on Electricity data (0.247 MSE) compared to specialized models.

Key Novelty

Similarity-Guided Supervised Contrastive Finetuning

Pretrains an encoder using supervised contrastive learning where 'labels' are the dataset identities, teaching the model to distinguish features from different source datasets.
Derives a probabilistic similarity metric that estimates how likely a target sample belongs to each pretraining dataset based on representation proximity.
Finetunes the model using this probability to weight positive/negative pairs: target samples are pulled closer to representations of similar pretraining datasets and pushed away from dissimilar ones.

Architecture

Diagram of the pretraining process using supervised contrastive learning. Shows the encoder-decoder structure and how contrastive loss is applied to representations using dataset labels.

Evaluation Highlights

Finetuned model achieves 0.269 MSE on Exchange-Rate dataset (average over 4 horizons), outperforming the TimesNet baseline (0.282 average) and all other supervised methods.
On ETTh1, the finetuned model achieves 0.446 average MSE, surpassing TimesNet (0.476 average).
Pretraining enables accurate dataset identification: the model correctly identifies Exchange-Rate samples with 87.79% probability, correlating with strong downstream performance.

Breakthrough Assessment

5/10

Proposes a logical extension of contrastive learning to time series domain adaptation. Results are mixed: outperforms baselines on some datasets (Exchange, ETTh1) but underperforms on others (Traffic, Weather, ETTh2). Evaluation uses simple linear backbones.

⚙️ Technical Details

Problem Definition

Setting: Multivariate time series forecasting where a model pretrained on a collection of P datasets adapts to a new target dataset

Inputs: Historical time series window x_{t:t+I} of length I

Outputs: Future time series window x_{t+I:t+I+O} of length O

Pipeline Flow

Data Sampling (Mixing P datasets)
Pretraining (Encoder-Decoder with SupCon loss on dataset labels)
Similarity Estimation (Calculate p_i for target data)
Finetuning (Encoder-Decoder with weighted SupCon loss based on p_i)

System Modules

Encoder

Maps input time series window to a latent representation z

Model or implementation: Simple Linear Layer

Decoder

Maps latent representation z to future forecast

Model or implementation: Simple Linear Layer

Similarity Estimator

Calculates probability p_i that a target sample is similar to pretraining dataset i

Model or implementation: Softmax over dot products with stored pretraining representations

Novel Architectural Elements

Finetune Contrastive Loss (FTCon): A modified contrastive loss where positive/negative pairs are dynamically defined based on the estimated probabilistic similarity to pretraining source domains.

Modeling

Base Model: Linear Encoder-Decoder (Channel Independent)

Training Method: Supervised Contrastive Learning (Pretraining) followed by Similarity-Guided Contrastive Finetuning

Objective Functions:

Purpose: Pretraining loss combining forecasting error and dataset discrimination.

Formally: Loss = ||x_hat - x||^2 + lambda * SupCon(z), where SupCon uses dataset IDs as labels.
Purpose: Finetuning loss combining forecasting error and alignment with similar source domains.

Formally: Loss = ||x_hat - x||^2 + lambda' * FTCon(z), where FTCon defines positive set P(z) using datasets with probability p_i > 1/P.

Training Data:

Pretraining collection: Balanced sampling from ETTh1, ETTm1, Electricity, Exchange-Rate
Finetuning: 50% of the standard training set for target datasets

Key Hyperparameters:

learning_rate: Not reported in the paper
batch_size: 512 (pretrain), 32 (test)
epochs: 10 (pretrain), 10 (finetune)
+ 3 more
lambda: 0.1
tau: 0.1
optimizer: ADAM

Compute: Not reported in the paper

Comparison to Prior Work

vs. TimesNet/DLinear: Uses pretraining on external datasets rather than training from scratch on target data.
vs. TF-C [not cited in paper]: Uses supervised labels (dataset identity) for contrastive learning rather than augmentation-based self-supervision, and explicitly guides finetuning via dataset similarity.

Limitations

Simple linear backbone may limit capacity compared to Transformers for complex dependencies.
Performance is inconsistent; degrades on Traffic and Weather datasets compared to baselines.
Relies on the assumption that target dynamics are similar to at least one pretraining dataset.
Inaccurate probability estimation (e.g., on Electricity data) leads to poor finetuning guidance.

Reproducibility

No replication artifacts mentioned in the paper. Code URL is not provided. Pretraining dataset construction details are in Appendix (balanced sampling by repeating smaller datasets).

📊 Experiments & Results

Evaluation Setup

Multivariate time series forecasting with lookback window I and prediction horizon O.

Benchmarks:

ETTh1, ETTm1, ETTh2, ETTm2 (Electricity Transformer Temperature forecasting)
Electricity (Electricity consumption forecasting)
Exchange-Rate (Currency exchange rate forecasting)
Traffic (Traffic occupancy forecasting)
Weather (Weather indicator forecasting)

Metrics:

MSE
MAE
Statistical methodology: Experiments repeated 3 times. No statistical significance tests reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Finetuning results showing comparison against TimesNet (SOTA) on datasets where the method performs well.
Exchange-Rate	MSE	0.282	0.269	-0.013
ETTh1	MSE	0.386	0.385	-0.001
Results on datasets where the method underperforms compared to SOTA.
Traffic	MSE	0.650	0.664	+0.014
Weather	MSE	0.196	0.194	-0.002

Experiment Figures

Sensitivity analysis of the regularization parameter lambda across eight datasets.

Main Takeaways

The proposed pretrain-finetune approach generalizes well on Exchange-Rate and ETTh1, outperforming supervised baselines.
The method correctly identifies dataset similarity (e.g., matching Exchange-Rate target data to Exchange-Rate source data with ~88% probability).
Performance drops on datasets like Traffic and ETTh2 which were not in the pretraining set, suggesting limited zero-shot transferability without domain overlap.
Simple linear backbones are sufficient to capture dynamics when guided by strong contrastive pretraining signals.

📚 Prerequisite Knowledge

Prerequisites

Supervised Contrastive Learning
Time Series Forecasting (Encoder-Decoder architectures)
Channel Independence in time series modeling

Key Terms

Supervised Contrastive Learning: A learning paradigm where the contrastive loss pulls together embeddings of samples with the same label and pushes apart those with different labels.

Channel Independence: A modeling strategy where multivariate time series are treated as multiple independent univariate series, sharing the same model weights.

MSE: Mean Squared Error—a standard metric for regression tasks measuring the average squared difference between predicted and actual values.

MAE: Mean Absolute Error—a standard metric measuring the average absolute difference between predicted and actual values.

TimesNet: A state-of-the-art time series foundation model architecture used as a primary baseline in this paper.

ETTh1/ETTm1: Standard datasets for time series forecasting containing electricity transformer data (hourly and 15-minute intervals).

SupCon: Supervised Contrastive loss function.

Foundation Models: Large-scale models trained on broad data to be adapted to downstream tasks; here applied to time series.