TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting

📝 Paper Summary

Time Series Forecasting Foundation Models Efficient Tokenization

TimeSqueeze is a hybrid tokenizer that adaptively patches time series based on local signal complexity, enabling efficient long-context pretraining without sacrificing fine-grained detail.

Core Problem

Existing tokenization methods for time series force a trade-off: point-wise embeddings are computationally expensive for long sequences, while fixed-size patching blurs local dynamics and struggles with heterogeneous signal complexity.

Why it matters:

Long-context pretraining is essential for high-performance foundation models but is bottlenecked by quadratic attention costs.
Real-world time series have varying information density (e.g., stable periods vs. rapid fluctuations), making uniform compression suboptimal.
Inefficient tokenization limits the scalability of foundation models to thousands of timesteps, restricting their applicability in domains like finance and healthcare.

Concrete Example: A financial time series might be stable for hours (low information) but extremely volatile for minutes (high information). A fixed patch size of 16 would waste tokens on the stable region and blur the volatile region. TimeSqueeze assigns large patches to the stable part and small patches to the volatile part.

Key Novelty

Content-Aware Dynamic Patching via SSM Encoder

Uses a lightweight State Space Model (Mamba) encoder to process the full-resolution signal first, capturing fine details before any compression occurs.
Dynamically determines patch boundaries based on signal volatility (relative deviation) rather than fixed intervals, allocating more tokens to complex regions and fewer to simple ones.
Preserves absolute positional information of the original signal to maintain temporal fidelity even after variable-length compression.

Architecture

The end-to-end architecture of TimeSqueeze, illustrating the hybrid tokenization process.

Evaluation Highlights

Achieves up to 20x faster convergence during pretraining compared to point-token baselines.
Demonstrates 8x higher data efficiency in pretraining relative to equivalent point-token models.
Consistently outperforms fixed-patching and point-tokenization baselines on long-horizon forecasting benchmarks in both zero-shot and full-shot settings.

Breakthrough Assessment

8/10

Addresses a critical bottleneck in time series foundation models (fixed patching vs. point-wise cost) with a theoretically grounded, adaptive solution that yields significant efficiency gains.

⚙️ Technical Details

Problem Definition

Setting: Multivariate time series forecasting via channel independence (univariate modeling)

Inputs: Sequence of T historical data points X_{1:T}

Outputs: Predicted future values for horizon H, X_{T+1:T+H}

Pipeline Flow

Encoder (SSM-based, full resolution)
Dynamic Patching Module (Adaptive Downsampling)
Transformer Backbone (Time-MoE)
Unpatching Module (Upsampling)
Decoder (SSM-based, full resolution)
Forecasting Head (Multi-horizon)

System Modules

Encoder

Extract fine-grained features at full resolution using linear-complexity layers

Model or implementation: Mamba layers (SSM)

Dynamic Patching Module

Compress embeddings by selecting patch boundaries based on local signal power and deviation

Model or implementation: Relative deviation-based thresholding algorithm

Transformer Backbone

Model global causal dependencies on the compressed sequence

Model or implementation: Time-MoE (Decoder-only Transformer with Mixture-of-Experts)

Unpatching Module

Restore sequence to original length by repeating boundary embeddings

Model or implementation: Repetition operation

Decoder

Combine backbone outputs with encoder residuals to produce final representations

Model or implementation: Mamba layers (SSM)

Forecasting Head

Generate predictions for multiple future horizons

Model or implementation: Multi-head FFN

Novel Architectural Elements

Hybrid tokenizer combining SSM-based feature extraction with dynamic, signal-driven patching.
Unpatching mechanism that repeats boundary tokens to restore resolution while preserving causality.
Integration of variable-resolution tokens with absolute position IDs into a standard Transformer backbone.

Modeling

Base Model: Time-MoE (Transformer with MoE)

Training Method: Pretraining on large-scale time series corpus

Objective Functions:

Purpose: Minimize forecasting error while being robust to outliers.

Formally: Huber Loss (L_ar) between predicted and actual values.
Purpose: Ensure balanced utilization of MoE experts.

Formally: Load balancing auxiliary loss (L_aux).

Training Data:

Time-300B dataset (300 billion time points)
Diverse domains: weather, transportation, finance, synthetic data

Key Hyperparameters:

batch_size: 256
max_context_length: 2048
training_steps: 100,000
+ 3 more
patch_threshold_tau: 0.3
max_patch_size: 8
target_compression_ratio: 4

Compute: Not reported in the paper

Comparison to Prior Work

vs. PatchTST: Dynamic patching vs. fixed patching; adapts to local signal complexity.
vs. Time-MoE: Uses patching for efficiency vs. point-wise encoding; significantly faster training.
vs. HDMixer/LightGTS: Performs within-sequence dynamic patching (varying sizes inside one series) vs. sequence-level adjustment.
+ 1 more
vs. EntroPE [not cited in paper]: TimeSqueeze uses signal deviation/power in continuous space vs. entropy-based discretization.

Limitations

Depends on a tunable threshold parameter (tau) for patching sensitivity.
Requires an additional SSM encoder/decoder, adding architectural complexity compared to pure Transformers.
The benefits are most pronounced for long contexts; gains may be smaller for very short series.

Reproducibility

Code availability is not provided. The dataset Time-300B is mentioned as open-access. Hyperparameters for the patching threshold and model sizes (Base: 117M params, Large: 469M params) are provided.

📊 Experiments & Results

Evaluation Setup

Long-horizon forecasting on univariate and multivariate tasks

Benchmarks:

Long-horizon forecasting benchmarks (Forecasting)

Metrics:

MSE (Mean Squared Error)
MAE (Mean Absolute Error)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
TimeSqueeze demonstrates superior training efficiency compared to point-token baselines.
Pretraining Convergence	Convergence Speed	1x	20x	+19x
Pretraining Data Efficiency	Data Efficiency	1x	8x	+7x

Main Takeaways

TimeSqueeze consistently outperforms architectures using point-wise tokenization or fixed-size patching across long-horizon benchmarks.
The dynamic patching strategy effectively balances information preservation and computational cost.
The method scales well, showing consistent gains in both zero-shot and full-shot settings.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Attention mechanisms)
State Space Models (SSM) / Mamba
Time series forecasting fundamentals (horizon, lookback)
Tokenization strategies (patching vs. point-wise)

Key Terms

SSM: State Space Model—a class of sequence models like Mamba that scale linearly with sequence length, used here for efficient pre-encoding.

Mamba: A specific recurrent architecture based on SSMs that provides efficient long-context processing.

MoE: Mixture-of-Experts—a Transformer architecture where only a subset of parameters (experts) are active for each token, improving efficiency.

RoPE: Rotary Positional Embeddings—a method for encoding position information that generalizes better to varying sequence lengths.

Channel Independence: A modeling strategy where multivariate time series are decomposed into independent univariate series for processing.

Patching: Grouping consecutive time points into a single vector (token) to reduce sequence length for the Transformer.

Huber Loss: A loss function that is less sensitive to outliers than squared error, combining L1 and L2 penalties.