On the Value of Tokeniser Pretraining in Physics Foundation Models

📝 Paper Summary

Physics Foundation Models Physics Emulation Representation Learning

Pretraining the tokeniser (encoder-decoder) before training the dynamics model significantly improves efficiency and accuracy for physics emulation, provided the pretraining data aligns physically with the downstream task.

Core Problem

Training physics foundation models from scratch is computationally expensive because learning compact data representations (tokenisation) and complex dynamics simultaneously impedes the effectiveness of both processes.

Why it matters:

High-resolution scientific simulations produce vast data volumes that are computationally prohibitive to model directly in pixel space with transformers
Current approaches often train tokenisers and dynamics models jointly from scratch, contrasting with computer vision where pretrained tokenisers are standard
Practitioners with limited compute resources struggle to train effective emulators without efficient initialization strategies

Concrete Example: When training a rollout model for fluid dynamics (Euler equations) from scratch, the model struggles to capture low-frequency structures early on. In contrast, using a tokeniser pretrained on the same Euler data reduces error by 64% after just 10,500 steps.

Key Novelty

Systematic evaluation of Tokeniser Pretraining for Physics

Decouples the learning of spatial representations (tokeniser) from temporal dynamics (processor), allowing each to be optimized more effectively
Demonstrates that 'in-domain' pretraining (same physics) yields massive gains, while 'out-of-domain' (different physics) offers smaller benefits
Introduces flexible spatiotemporal compression that allows runtime adjustment of token coarseness to handle different physical regimes without retraining

Architecture

Schematic of the training setup comparing 'From Scratch' vs 'Pretrained' approaches.

Evaluation Highlights

In-domain pretraining reduces spatial error (VRMSE) by 64% (0.439 → 0.158) compared to training from scratch after 10.5k steps
Freezing a pretrained tokeniser (updating only 2% of parameters) matches the performance of a fully trainable one for short horizons
For long rollouts (7-18 steps), the frozen tokeniser strategy outperforms the fully trainable approach, acting as a regularizer against error accumulation

Breakthrough Assessment

7/10

Provides the first systematic empirical evidence justifying separate tokeniser pretraining for physics models—a standard in vision but previously assumed in physics. The flexible compression mechanism adds practical utility.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive prediction of physical fields on 2D grids

Inputs: Sequence of past frames x_{0:t-1}

Outputs: Predicted next frame x_hat_t

Pipeline Flow

Tokeniser Encoder (compress pixels to latents)
Projection Layer (match dimensions)
Transformer Processor (predict next latent state)
Projection Layer (match dimensions back)
Tokeniser Decoder (reconstruct pixels from latents)

System Modules

Tokeniser Encoder (Tokenisation)

Compresses spatiotemporal input data into latent representations

Model or implementation: Simplified MAGVIT-2 (Causal CNN)

Processor

Models global dependencies and predicts future states in latent space

Model or implementation: Walrus Transformer (Factorised spatial/temporal attention)

Tokeniser Decoder (Tokenisation)

Reconstructs physical fields from latent representations

Model or implementation: Simplified MAGVIT-2 (Causal CNN)

Novel Architectural Elements

Flexible spatiotemporal compression operations extending causal convolutions to support runtime-adjustable compression ratios

Modeling

Base Model: Walrus-based Transformer Processor + MAGVIT-2-based Tokeniser

Training Method: Two-stage training: (1) Tokeniser Autoencoding, (2) Dynamics Modeling (Processor)

Objective Functions:

Purpose: Train tokeniser to compress and reconstruct data.

Formally: MSE between reconstruction x_hat and input x.
Purpose: Train processor to predict next frame.

Formally: Mean Absolute Error (MAE) between predicted frame x_hat_t and ground truth x_t.

Training Data:

The Well dataset collection
Euler multiquadrants (target domain)
Rayleigh-Bénard, Shear flow, Active matter (pretraining domains)
10-frame sequences: 9 for context/autoencoding, 10th for prediction target

Key Hyperparameters:

tokeniser_batch_size: 16 (effective)
rollout_batch_size: 16 (effective)
tokeniser_steps: 168,000
+ 3 more
rollout_steps: 29,400
processor_embedding_dim: 1088
processor_layers: 6 blocks

Compute: 8x H100 GPUs per run. Training time for 2100 steps: ~14-15 minutes (varies by config).

Comparison to Prior Work

vs. End-to-end training: This paper proves staged training is more efficient for physics, reducing error by 64% for same compute
vs. Vision models: Adapts the pretraining paradigm to physics, highlighting that 'domain alignment' is more critical here than in general vision

Limitations

High-frequency details remain difficult to model (NEPS near 1.0) regardless of pretraining strategy
Out-of-domain pretraining offers significantly lower benefits than in-domain, limiting the promise of a 'universal' physics tokeniser
Experiments focused on early-stage training efficiency (first 30k steps); convergence behavior at very long training schedules is less explored

Reproducibility

Code: https://github.com/PolymathicAI/the_well

Publicly available dataset (The Well). Code URL provided in paper metadata. Detailed architecture and hyperparameters in Appendix. Uses standard optimizers (AdamW, SOAP).

📊 Experiments & Results

Evaluation Setup

Autoregressive next-frame prediction on 2D physics simulations

Benchmarks:

Euler Multiquadrants (Fluid dynamics simulation)

Metrics:

VRMSE (Variance-Normalised Root Mean Squared Error)
NEPS (Normalised Error Power Spectrum)
Wall-clock training time
Statistical methodology: Validation metrics averaged over 10 subsets of 4096 examples. No explicit confidence intervals reported.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Impact of pretraining strategies on next-frame prediction error (VRMSE) after 10,500 training steps.
Euler Multiquadrants	VRMSE	0.439	0.158	-0.281
Euler Multiquadrants	VRMSE	0.439	0.162	-0.277
Euler Multiquadrants	VRMSE	0.439	0.354	-0.085
Euler Multiquadrants	VRMSE	0.439	0.559	+0.120

Experiment Figures

Evolution of validation metrics (VRMSE and NEPS at different scales) over 30k training steps.

VRMSE accumulated over rollout steps (short, medium, long horizons).

Main Takeaways

In-domain pretraining is far superior to out-of-domain pretraining, suggesting physics representations are less transferrable than generic image features
Freezing the tokeniser (Mostly Frozen) acts as a regularizer, preventing error accumulation in long autoregressive rollouts (steps 7-18)
Low-frequency errors are reduced by orders of magnitude with in-domain pretraining; high-frequency errors remain challenging for all models
Training costs are similar across methods (e.g., ~15 mins for 2100 steps), meaning performance gains come effectively 'free' in terms of downstream training time

📚 Prerequisite Knowledge

Prerequisites

Transformer architectures (specifically for sequence modeling)
Autoencoders (Encoder-Decoder architectures)
Fourier analysis (Power spectrum)
Basic fluid dynamics concepts (Euler equations, turbulence)

Key Terms

Tokeniser: A neural network (usually an autoencoder) that compresses high-dimensional data into compact latent representations (tokens) for a downstream model to process

VRMSE: Variance-Normalised Root Mean Squared Error—a metric measuring reconstruction error relative to the natural variability of the target field

NEPS: Normalised Error Power Spectrum—a frequency-domain metric measuring the ratio of error power to signal power at specific spatial scales (wavenumbers)

Rollout: The process of generating a sequence of future predictions autoregressively, where each prediction is fed back as input for the next step

FSDP: Fully Sharded Data Parallel—a memory-optimization technique for distributed training that shards model parameters across GPUs

DDP: Distributed Data Parallel—a parallel training technique where each process has a model copy and gradients are synchronized

SOAP: A specific optimizer used for pretraining the tokeniser in this paper

Causal convolution: Convolution operations that only use information from past and present time steps, preserving the temporal order required for autoregressive tasks