Multiple Physics Pretraining for Physical Surrogate Models

📝 Paper Summary

Scientific Machine Learning (SciML) Spatiotemporal Surrogate Modeling Foundation Models for Physics

MPP trains a single transformer backbone on multiple heterogeneous physical systems simultaneously using shared embeddings and normalization to enable zero-shot prediction and efficient transfer to unseen physics.

Core Problem

Deep learning surrogates for physics are typically trained on single, specific systems, making them data-hungry and unable to transfer knowledge to new physical regimes or equations.

Why it matters:

Training surrogates from scratch is impractical for low-data settings common in simulation-driven exploration
Current methods fail to leverage the shared underlying principles (conservation laws, advection, diffusion) common across different PDEs
Existing 'foundation models' in vision/language leverage massive data, but this scale has not yet been successfully applied to nonlinear spatiotemporal physics

Concrete Example: A model trained solely on advection cannot predict diffusion, and vice-versa. To model a combined advection-diffusion system, standard approaches require training a new model from scratch, whereas MPP leverages features learned from observing advection and diffusion separately.

Key Novelty

Multiple Physics Pretraining (MPP)

Projects diverse physical fields (pressure, velocity) from different systems into a shared embedding space using 1x1 convolutions
Normalizes varying scales using Reversible Instance Normalization (RevIN) to allow a single backbone to process heterogeneous magnitudes
Uses an Axial Attention backbone to efficiently process high-dimensional spatiotemporal data by attending to time and space axes independently

Architecture

The architecture of the Multiple Physics Pretraining (MPP) transformer backbone.

Breakthrough Assessment

8/10

Proposes a viable architecture for a 'Physics Foundation Model' that handles heterogeneous inputs and scales, addressing a major bottleneck in Scientific ML. Methodologically sound, though full experimental results were not in the provided snippet.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive prediction of discretized spatiotemporal dynamical systems described by PDEs

Inputs: Sequence of uniformly spaced snapshots U_t = [u_{t-T*dt}, ..., u_t] from system S

Outputs: Next snapshot u_{t+dt}

Pipeline Flow

Input Sequence -> Normalization (RevIN)
Field Embedding (Projection to shared space)
Axial Transformer Backbone (Processing)
Output Projection (Decoding)
Denormalization (RevIN)

System Modules

Reversible Instance Normalization (RevIN) (Input Processing)

Standardize input fields by computing mean and std dev over space-time to handle varying scales across different physical systems

Model or implementation: Statistical Normalization

Field Embedding (Input Processing)

Project system-specific fields (e.g., velocity, pressure) into a uniform latent dimension

Model or implementation: 1x1 Convolution (Learnable per system)

Axial ViT Backbone

Learn spatiotemporal dynamics via attention mechanisms

Model or implementation: Transformer with Axial Attention

Output Projection

Project latent representations back to physical field dimensions

Model or implementation: 1x1 Convolution (Transposed/Inverse)

Novel Architectural Elements

Shared embedding strategy combining RevIN with learnable 1x1 convolutions to map heterogeneous physical fields to a common latent space
Fully axial attention backbone applied to multi-physics surrogate modeling
Modified relative position encodings (RPE) to handle both periodic and non-periodic boundary conditions

Modeling

Base Model: Axial Vision Transformer (AViT)

Training Method: Autoregressive Pretraining

Objective Functions:

Purpose: Minimize prediction error while balancing gradients across systems with different scales.

Formally: Normalized MSE (NMSE) = || M(U_t) - u_{t+dt} ||^2 / (Var(u_{t+dt}) + epsilon)

Training Data:

Trajectories generated from Advection and Diffusion equations
100,000 trajectories per system type
Uniformly sampled coefficients: velocity v in [-3, -0.1] U [0.1, 3], diffusion delta in [10^-3, 1]

Key Hyperparameters:

micro_batches: Variable (sampled per step)
loss_stabilization: epsilon added to variance

Compute: Not reported in the paper

Comparison to Prior Work

vs. Neural Operators: MPP trains on multiple systems simultaneously rather than a single specific PDE
vs. Video Foundation Models: MPP incorporates physics-specific inductive biases like periodic boundaries and handles continuous field values rather than discrete pixels
vs. Subramanian et al. (2023): MPP handles nonlinear systems and varying resolutions/fields, whereas Subramanian focused on linear steady-state systems [cited in paper]

Limitations

Assumes physical systems can be discretized onto a grid (incompatible with some mesh-based approaches)
Periodic boundary handling requires specific position bias modifications
Gradient accumulation introduces sampling noise which acts as implicit regularization (could be varying)

Reproducibility

Code and model weights are stated to be open-sourced, but the URL is not included in the provided text snippet.

📊 Experiments & Results

Evaluation Setup

Pretraining on Advection and Diffusion equations; finetuning/evaluating on Advection-Diffusion

Benchmarks:

1D Advection-Diffusion (Spatiotemporal prediction (Toy problem for hypothesis testing)) [New]

Metrics:

Normalized Mean Squared Error (NMSE)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Comparison of test error vs. number of training samples for a model pretrained on Advection/Diffusion vs. a model trained from scratch on Advection-Diffusion.

Main Takeaways

Qualitative finding: Pretraining on separate physical components (Advection and Diffusion independently) enables better transfer learning to the combined system (Advection-Diffusion) compared to training from scratch.
The shared embedding and normalization strategy successfully handles systems with different parameter scales (velocities and diffusion coefficients varying by orders of magnitude).
Axial attention mechanism allows scaling to higher dimensional spatiotemporal inputs compared to standard dense attention.

📚 Prerequisite Knowledge

Prerequisites

Partial Differential Equations (PDEs)
Transformer architectures (Attention mechanisms)
Basic fluid dynamics (Advection, Diffusion)

Key Terms

MPP: Multiple Physics Pretraining—the proposed framework for training a single model on diverse physical systems

RevIN: Reversible Instance Normalization—a technique to normalize inputs by their mean/variance and denormalize outputs using the same statistics

Axial Attention: An attention mechanism that computes attention along specific axes (e.g., time, height, width) sequentially rather than all at once, reducing computational complexity

PDE: Partial Differential Equation—mathematical equations describing how physical quantities change over space and time

Surrogate Model: A fast approximation model (often a neural network) used to predict system behavior instead of running a slow, exact numerical simulation

Autoregressive: A prediction setup where the model predicts the next step in a sequence and feeds that prediction back as input for the following step

Gradient Accumulation: A training technique where gradients are calculated over multiple micro-batches before updating model weights, allowing for larger effective batch sizes

Spatio-temporal: Relating to both space and time