Sravan Kumar Ankireddy, Nikita Seleznev, Nam H. Nguyen, Yulun Wu, Senthil Kumar, Furong Huang, C. Bayan Bruss
arXiv
(2026)
PretrainingBenchmark
📝 Paper Summary
Time Series ForecastingFoundation ModelsEfficient Tokenization
TimeSqueeze is a hybrid tokenizer that adaptively patches time series based on local signal complexity, enabling efficient long-context pretraining without sacrificing fine-grained detail.
Core Problem
Existing tokenization methods for time series force a trade-off: point-wise embeddings are computationally expensive for long sequences, while fixed-size patching blurs local dynamics and struggles with heterogeneous signal complexity.
Why it matters:
Long-context pretraining is essential for high-performance foundation models but is bottlenecked by quadratic attention costs.
Real-world time series have varying information density (e.g., stable periods vs. rapid fluctuations), making uniform compression suboptimal.
Inefficient tokenization limits the scalability of foundation models to thousands of timesteps, restricting their applicability in domains like finance and healthcare.
Concrete Example:A financial time series might be stable for hours (low information) but extremely volatile for minutes (high information). A fixed patch size of 16 would waste tokens on the stable region and blur the volatile region. TimeSqueeze assigns large patches to the stable part and small patches to the volatile part.
Key Novelty
Content-Aware Dynamic Patching via SSM Encoder
Uses a lightweight State Space Model (Mamba) encoder to process the full-resolution signal first, capturing fine details before any compression occurs.
Dynamically determines patch boundaries based on signal volatility (relative deviation) rather than fixed intervals, allocating more tokens to complex regions and fewer to simple ones.
Preserves absolute positional information of the original signal to maintain temporal fidelity even after variable-length compression.
Architecture
The end-to-end architecture of TimeSqueeze, illustrating the hybrid tokenization process.
Evaluation Highlights
Achieves up to 20x faster convergence during pretraining compared to point-token baselines.
Demonstrates 8x higher data efficiency in pretraining relative to equivalent point-token models.
Consistently outperforms fixed-patching and point-tokenization baselines on long-horizon forecasting benchmarks in both zero-shot and full-shot settings.
Breakthrough Assessment
8/10
Addresses a critical bottleneck in time series foundation models (fixed patching vs. point-wise cost) with a theoretically grounded, adaptive solution that yields significant efficiency gains.
⚙️ Technical Details
Problem Definition
Setting: Multivariate time series forecasting via channel independence (univariate modeling)
Inputs: Sequence of T historical data points X_{1:T}
Outputs: Predicted future values for horizon H, X_{T+1:T+H}
Pipeline Flow
Encoder (SSM-based, full resolution)
Dynamic Patching Module (Adaptive Downsampling)
Transformer Backbone (Time-MoE)
Unpatching Module (Upsampling)
Decoder (SSM-based, full resolution)
Forecasting Head (Multi-horizon)
System Modules
Encoder
Extract fine-grained features at full resolution using linear-complexity layers
Model or implementation: Mamba layers (SSM)
Dynamic Patching Module
Compress embeddings by selecting patch boundaries based on local signal power and deviation
Model or implementation: Relative deviation-based thresholding algorithm
Transformer Backbone
Model global causal dependencies on the compressed sequence
Model or implementation: Time-MoE (Decoder-only Transformer with Mixture-of-Experts)
Unpatching Module
Restore sequence to original length by repeating boundary embeddings
Model or implementation: Repetition operation
Decoder
Combine backbone outputs with encoder residuals to produce final representations
Model or implementation: Mamba layers (SSM)
Forecasting Head
Generate predictions for multiple future horizons
Model or implementation: Multi-head FFN
Novel Architectural Elements
Hybrid tokenizer combining SSM-based feature extraction with dynamic, signal-driven patching.
Unpatching mechanism that repeats boundary tokens to restore resolution while preserving causality.
Integration of variable-resolution tokens with absolute position IDs into a standard Transformer backbone.
Modeling
Base Model: Time-MoE (Transformer with MoE)
Training Method: Pretraining on large-scale time series corpus
Objective Functions:
Purpose: Minimize forecasting error while being robust to outliers.
Formally: Huber Loss (L_ar) between predicted and actual values.
Purpose: Ensure balanced utilization of MoE experts.
Formally: Load balancing auxiliary loss (L_aux).
Training Data:
Time-300B dataset (300 billion time points)
Diverse domains: weather, transportation, finance, synthetic data
vs. EntroPE [not cited in paper]: TimeSqueeze uses signal deviation/power in continuous space vs. entropy-based discretization.
Limitations
Depends on a tunable threshold parameter (tau) for patching sensitivity.
Requires an additional SSM encoder/decoder, adding architectural complexity compared to pure Transformers.
The benefits are most pronounced for long contexts; gains may be smaller for very short series.
Reproducibility
Code availability is not provided. The dataset Time-300B is mentioned as open-access. Hyperparameters for the patching threshold and model sizes (Base: 117M params, Large: 469M params) are provided.
📊 Experiments & Results
Evaluation Setup
Long-horizon forecasting on univariate and multivariate tasks
Benchmarks:
Long-horizon forecasting benchmarks (Forecasting)
Metrics:
MSE (Mean Squared Error)
MAE (Mean Absolute Error)
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
TimeSqueeze demonstrates superior training efficiency compared to point-token baselines.
Pretraining Convergence
Convergence Speed
1x
20x
+19x
Pretraining Data Efficiency
Data Efficiency
1x
8x
+7x
Main Takeaways
TimeSqueeze consistently outperforms architectures using point-wise tokenization or fixed-size patching across long-horizon benchmarks.
The dynamic patching strategy effectively balances information preservation and computational cost.
The method scales well, showing consistent gains in both zero-shot and full-shot settings.
📚 Prerequisite Knowledge
Prerequisites
Transformer architecture (Attention mechanisms)
State Space Models (SSM) / Mamba
Time series forecasting fundamentals (horizon, lookback)
Tokenization strategies (patching vs. point-wise)
Key Terms
SSM: State Space Model—a class of sequence models like Mamba that scale linearly with sequence length, used here for efficient pre-encoding.
Mamba: A specific recurrent architecture based on SSMs that provides efficient long-context processing.
MoE: Mixture-of-Experts—a Transformer architecture where only a subset of parameters (experts) are active for each token, improving efficiency.
RoPE: Rotary Positional Embeddings—a method for encoding position information that generalizes better to varying sequence lengths.
Channel Independence: A modeling strategy where multivariate time series are decomposed into independent univariate series for processing.
Patching: Grouping consecutive time points into a single vector (token) to reduce sequence length for the Transformer.
Huber Loss: A loss function that is less sensitive to outliers than squared error, combining L1 and L2 penalties.