Strategies for Pretraining Neural Operators

📝 Paper Summary

Neural Operators Scientific Machine Learning (SciML) Pretraining strategies

This work conducts a systematic, model-agnostic comparison of vision-inspired and physics-based pretraining strategies for neural operators, finding that data augmentation and physics-aware tasks consistently improve generalization.

Core Problem

Neural operators often struggle to generalize to unseen physics and require large datasets, yet existing pretraining studies use tailored, incompatible architectures that prevent fair comparison of pretraining methods.

Why it matters:

Current PDE pretraining is fragmented; tailored architectures make it impossible to isolate the effect of the pretraining strategy itself from the model design
Training neural operators from scratch is slow and data-hungry; effective pretraining could enable few-shot learning in engineering contexts where data is scarce

Concrete Example: A neural operator trained only on simple diffusion equations fails when tested on the Burgers equation with different coefficients. Current approaches fix this by designing a specific 'Burgers-Architecture,' whereas this paper seeks a general pretraining task (like 'Jigsaw' or 'Masking') that improves performance on *any* operator backbone.

Key Novelty

Systematic Benchmarking of Vision-Adapted Pretraining for PDEs

Adapts computer vision pretraining tasks (e.g., Jigsaw puzzles, Masked Autoencoding) to the physics domain by treating PDE solutions as spatio-temporal videos
Proposes physics-specific pretraining tasks like 'Derivative' (predicting spatial/temporal derivatives) and 'Coefficient' (regressing equation parameters) to learn dynamics
Evaluates these strategies across multiple standard backbones (FNO, UNet, Transformer) to decouple strategy effectiveness from architecture scaling

Architecture

A conceptual diagram of the proposed pretraining strategies. It illustrates the different self-supervised tasks used to train the neural operator before fine-tuning.

Evaluation Highlights

Pretraining with Jigsaw or Masked strategies improves FNO performance by ~15-20% on downstream tasks compared to training from scratch
Physics-agnostic data augmentations (Noise, Scale) consistently improve pretraining performance across all tested models and datasets
In low-data regimes (10% training data), pretrained models significantly outperform scratch models, showing strong few-shot generalization capabilities

Breakthrough Assessment

7/10

Provides a much-needed rigorous empirical comparison of pretraining methods for PDEs, moving the field away from ad-hoc architectural solutions toward generalizable learning strategies.

⚙️ Technical Details

Problem Definition

Setting: Pretraining a neural operator G_theta on a source distribution of PDE solutions, then fine-tuning on a target distribution

Inputs: Input function a(x) (e.g., initial condition or coefficients)

Outputs: Solution function u(x, t) evolving over space and time

Pipeline Flow

Input PDE Data (u)
Data Augmentation (Shift, Noise, Scale)
Pretraining Task Generation (e.g., Masking, Shuffling, Derivative calculation)
Neural Operator Backbone (FNO/UNet/Transformer)
Task-Specific Head (Decoder/Regressor)
Fine-tuning on Downstream PDE

System Modules

Data Augmentor

Apply physics-preserving or physics-agnostic transformations to increase data diversity

Model or implementation: Algorithmic transformations

Pretraining Task Generator

Format data for the specific self-supervised objective

Model or implementation: Algorithmic sorting/masking

Neural Operator Backbone

Learn latent representations of physics dynamics

Model or implementation: Variable (FNO, UNet, OFormer)

Novel Architectural Elements

Adaptation of the 'Jigsaw' vision task to spatio-temporal physics data by patching PDE solutions in space and time
Introduction of 'Derivative' pretraining: a regression head predicting (u_x, u_y, u_t, etc.) from u to enforce learning of local dynamics

Modeling

Base Model: Evaluated on FNO (Fourier Neural Operator), UNet, and OFormer (Operator Transformer)

Training Method: Self-supervised pretraining followed by supervised fine-tuning

Objective Functions:

Purpose: Verify temporal order.

Formally: Binary Cross Entropy on labels indicating if video is forward or reversed (Binary).
Purpose: Reorder shuffled temporal frames.

Formally: Cross Entropy over N permutations of temporal chunks (TimeSort).
Purpose: Solve spatio-temporal puzzles.

Formally: Cross Entropy over K permutations of spatio-temporal patches (Jigsaw).
Purpose: Regress equation parameters.

Formally: MSE between predicted coefficients and ground truth (Coefficient).
Purpose: Learn local dynamics.

Formally: MSE between predicted derivatives (u_x, u_t...) and calculated finite-difference derivatives (Derivative).
Purpose: Reconstruct missing data.

Formally: MSE between reconstructed patches and masked patches (Masked).

Adaptation: Full fine-tuning of the pretrained backbone on the target PDE dataset

Trainable Parameters: All parameters of the backbone model

Training Data:

Pretraining: Large datasets of 2D PDEs (e.g., Burgers, Navier-Stokes)
Fine-tuning: Smaller subsets or different coefficient regimes of the target PDE

Key Hyperparameters:

learning_rate: 1e-3 (FNO), 1e-4 (OFormer)
batch_size: 32
epochs_pretrain: 50
+ 4 more
epochs_finetune: 100
optimizer: AdamW
weight_decay: 1e-4
scheduler: CosineAnnealing

Compute: Single NVIDIA A6000 GPU used for experiments

Comparison to Prior Work

vs. Lie Point Symmetry: This work evaluates augmentations directly on prediction/reconstruction tasks rather than just contrastive losses
vs. MP-PDE: This work focuses on model-agnostic pretraining tasks (Masking, Jigsaw) rather than designing a unified large-scale architecture
vs. Contrastive Methods (PICL): The authors find simple transfer/reconstruction baselines often outperform complex contrastive setups for regression tasks

Limitations

Study is limited to 2D PDE datasets; 3D or real-world geometry not tested
Comparison focuses on pretraining strategies, not identifying the absolute SOTA architecture
Some vision-based tasks (e.g., SpaceSort) failed completely and were omitted from final results

Reproducibility

Code: https://github.com/anthonyzhou-1/pretraining_pdes

Code publicly available at https://github.com/anthonyzhou-1/pretraining_pdes. Datasets available at https://zenodo.org/records/13355846. Full hyperparameter configurations provided in Appendix D.

📊 Experiments & Results

Evaluation Setup

Pretrain on a source PDE/distribution, then fine-tune on a target PDE/distribution to evaluate transfer efficiency and generalization

Benchmarks:

Navier-Stokes (NS) (Fluid dynamics prediction (viscosity var))
Shallow Water Equations (SWE) (Geophysical fluid dynamics)
Compressible Euler (Gas dynamics)

Metrics:

Relative L2 Error (Test Error)
Training Efficiency (Convergence speed)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different pretraining strategies on the FNO backbone, fine-tuned on Navier-Stokes (Re=500). Lower L2 error is better.
Navier-Stokes (Re=500)	Relative L2 Error	0.0156	0.0134	-0.0022
Navier-Stokes (Re=500)	Relative L2 Error	0.0156	0.0142	-0.0014
Navier-Stokes	Relative L2 Error	0.0156	0.0138	-0.0018
Results demonstrating performance in low-data regimes (few-shot learning).
Shallow Water Equations (10% Data)	Relative L2 Error	0.045	0.038	-0.007

Experiment Figures

Bar charts comparing Relative L2 Error across different pretraining strategies (x-axis) for FNO, UNet, and OFormer models (subplots) on Navier-Stokes data.

Main Takeaways

Physics-based strategies (Derivative, Coefficient) and dense vision tasks (Masked) generally outperform simple sorting tasks (Binary, TimeSort).
Data augmentation (Scale, Noise) is a highly effective, low-cost way to improve neural operator performance, often matching more complex pretraining.
Transformer-based backbones (OFormer) benefit more from large-scale pretraining than CNN-based or FNO backbones, aligning with trends in NLP/Vision.
Pretraining is most beneficial when the downstream task has limited data or is distributionally similar to the pretraining data.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Partial Differential Equations (PDEs)
Basics of Neural Operators (FNO, DeepONet)
Self-supervised learning concepts from Computer Vision (Masked Autoencoders, Contrastive Learning)

Key Terms

Neural Operator: A deep learning model designed to learn mappings between infinite-dimensional function spaces (e.g., mapping initial conditions to PDE solutions)

FNO: Fourier Neural Operator—an architecture that uses Fast Fourier Transforms to perform global convolution operations in the frequency domain

DeepONet: Deep Operator Network—an architecture using separate branch and trunk networks to learn operator mappings

Lie Point Symmetry: Geometric transformations (like scaling or shifting) that leave the solution set of a differential equation invariant

Jigsaw Pretraining: A self-supervised task where the model must unscramble shuffled patches of the input data (adapted here for spatio-temporal PDE grids)

Masked Pretraining: A self-supervised task where random portions of the input are hidden, and the model must reconstruct the missing parts (similar to BERT or MAE)