Time-MMD: Multi-Domain Multimodal Dataset for Time Series Analysis

📝 Paper Summary

Multimodal Time Series Analysis Time Series Forecasting Dataset Construction

Time-MMD is the first diverse multi-domain multimodal time-series dataset, aligned with a new forecasting library (MM-TSFlib) to demonstrate that integrating textual data significantly improves forecasting accuracy.

Core Problem

Existing multimodal time series datasets are narrow (mostly financial), effectively misaligned (irrelevant text), and contaminated (contain predictions or data leaks), preventing effective multimodal analysis.

Why it matters:

Real-world experts (e.g., epidemiologists) use text/policies alongside numbers, but current models are largely unimodal (numerical only).
Current datasets focus solely on stock prediction, failing to capture diverse patterns like periodicity or sparsity found in other domains.
Data contamination in existing sets (e.g., text containing future predictions) leads to biased evaluations of Large Language Model (LLM) based forecasters.

Concrete Example: In epidemiology, a 'weekly influenza report' might contain a section explicitly predicting next week's outlook. If a model trains on this raw text, it cheats by seeing the answer. Time-MMD uses LLMs to disentangle facts from predictions to prevent this leakage.

Key Novelty

Diverse Domain Coverage with LLM-Curated Alignment

Expands beyond finance to 9 domains (Health, Economics, Energy, etc.) with diverse temporal patterns.
Uses an LLM-based pipeline to filter irrelevant text and crucially separate 'facts' from 'predictions' to prevent data leakage.
Introduces a standardized binary timestamp system to align asynchronous textual reports (e.g., monthly) with numerical data (e.g., weekly).

Architecture

The Multimodal Integration Framework used in MM-TSFlib.

Evaluation Highlights

Multimodal models outperformed unimodal baselines in 95% of over 1,000 experiments.
Achieved over 15% Mean Squared Error (MSE) reduction generally across domains.
Up to 40% MSE reduction in domains with rich textual data, validating the quality of the aligned text.

Breakthrough Assessment

8/10

Significant infrastructure contribution. Moving multimodal time series beyond just stock prediction is a major step. The rigorous decontamination pipeline addresses a critical flaw in previous datasets.

⚙️ Technical Details

Problem Definition

Setting: Multimodal Time Series Forecasting (TSF)

Inputs: Numerical series X (length l) and Textual series S (length k)

Outputs: Forecasted numerical values Y (horizon h)

Pipeline Flow

Data Collection (Numerical & Textual)
Text Preprocessing (LLM-based filtering/disentangling)
Alignment (Binary Timestamps)
Forecasting (MM-TSFlib Integration)

System Modules

Data Collection (Data Construction)

Gather numerical data from 9 domains and text from reports/web search

Model or implementation: Google API (Search), Manual Selection (Reports)

Text Preprocessing (Data Construction)

Clean text, remove irrelevance, and separate facts from predictions

Model or implementation: Llama-3-70B

Forecasting Model

Predict future numerical values using both modalities

Model or implementation: Various TSF models + LLMs (via MM-TSFlib)

Novel Architectural Elements

End-to-end pipeline integrating any open-source LLM with arbitrary TSF models via learnable projection layers and linear weighting
Specific preprocessing architecture using LLMs to disentangle 'facts' from 'predictions' in historical text to prevent look-ahead bias

Modeling

Base Model: Llama-3-70B (for data construction), various LLMs (BERT, GPT-2, Llama-2/3) for forecasting

Training Method: Projection layer training (LLM frozen)

Objective Functions:

Purpose: Minimize prediction error.

Formally: Mean Squared Error (MSE) between predicted Y and ground truth Y

Trainable Parameters: Projection layers and linear weighting mechanism only (LLM backbones are frozen)

Training Data:

9 Domains: Health, Economics, Energy, Transport, Web, Climate, Agriculture, Security, Traffic
Cutoff dates up to May 2024

Key Hyperparameters:

lookback_window: Varies by domain (decided by experts)
horizon_window: Varies by domain

Compute: Not reported in the paper

Comparison to Prior Work

vs. Stock Datasets: Time-MMD covers 9 diverse domains (not just finance) and aligns granularities beyond just daily stock prices.
vs. Existing Multimodal Datasets: Time-MMD explicitly disentangles facts/predictions to prevent contamination.
vs. Unimodal TSF: MM-TSFlib integrates exogenous text (news/events) rather than just endogenous stats [not cited in paper]

Limitations

African region data in Health domain has significantly fewer reports than US data (fairness concern).
Reliance on Llama-3-70B for preprocessing may introduce its own biases or hallucinations (mitigated by referencing constraints).
Computational cost of processing vast textual archives is high.
Forecasting evaluation keeps LLM frozen; full fine-tuning might yield different results.

Reproducibility

Code: https://github.com/AdityaLab/Time-MMD

publicly available (https://github.com/AdityaLab/Time-MMD). Dataset includes metadata for numerical and text data (start/end times, fact/prediction text). Library (MM-TSFlib) supports 20+ TSF algorithms and 7 LLMs.

📊 Experiments & Results

Evaluation Setup

Multimodal Time Series Forecasting across 9 domains

Benchmarks:

Time-MMD (Multimodal Time Series Forecasting) [New]

Metrics:

Mean Squared Error (MSE)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
General performance aggregation across all experiments showing the benefit of adding multimodality.
Time-MMD (9 domains)	Win Rate	0	95	+95
Time-MMD (Average)	MSE Reduction	0	20	+20
Rich Textual Domains	MSE Reduction	0	40	+40

Experiment Figures

Extracted fact count per month over time by domain.

Word clouds for extracted facts, predictions, and discarded text in the Health domain.

Main Takeaways

Multimodal integration consistently improves forecasting accuracy across diverse domains, not just finance.
High-profile fields (like Health) have richer text data, correlating with higher performance gains (up to 40% MSE reduction).
The rigorous text filtering and alignment pipeline is effective, as evidenced by the performance jumps over unimodal baselines.
Search data count increases over time (internet growth), while report data is stable, suggesting complementary value.

📚 Prerequisite Knowledge

Prerequisites

Time Series Forecasting (TSF) basics
Multimodality (integrating text and numbers)
Large Language Models (LLMs) for data processing

Key Terms

TSF: Time-Series Forecasting—predicting future values based on historical data

MM-TSFlib: The multimodal time-series forecasting library introduced in this paper

Lookback window: The window of historical data used as input for the model

Horizon window: The future window of time steps the model attempts to predict

Binary timestamps: A method used in this paper to mark start and end dates for aligning text availability with numerical time steps

Endogenous text: Text derived directly from the numerical series (e.g., statistical descriptions)

Exogenous text: Auxiliary text from external sources (news, reports) providing context to the time series

Data contamination: When test data (or future information) leaks into the training set, giving the model an unfair advantage