A Practitioner's Guide to Continual Multimodal Pretraining

📝 Paper Summary

Continual Learning Multimodal Pretraining Foundation Models

The paper introduces FoMo-in-Flux, a benchmark for continual multimodal pretraining, demonstrating that simple finetuning coupled with model merging outperforms specialized continual learning methods under realistic compute constraints.

Core Problem

Multimodal foundation models become outdated as new domains and concepts emerge, but current continual pretraining research focuses on extreme cases (large-scale updates or tiny edits) rather than practical, minor updates.

Why it matters:

Real-world applications require adapting to specific subdomains (e.g., medical, synthetic) over a model's lifecycle without retraining from scratch
Existing benchmarks like TiC-RedCaps are too noisy and monolithic, while traditional Continual Learning benchmarks (Split-ImageNet) lack the scale and modality of foundation models
Practitioners lack guidance on how compute budgets, data stream ordering, and method choices affect the trade-off between learning new tasks and retaining zero-shot capabilities

Concrete Example: A deployed vision-language model might need to adapt sequentially to 'ruins', 'industrial areas', and 'bird species'. Naive finetuning forgets previous concepts (catastrophic forgetting), while parameter-efficient methods (LoRA) often struggle to learn the new data effectively (plasticity issues) under strict compute limits.

Key Novelty

FoMo-in-Flux Benchmark & Memory-Adjusted FLOPs (MAFs)

Constructs a controlled data stream from 63 diverse datasets (natural, synthetic, fine-grained) with high-quality captions, enabling precise study of concept ordering effects
Introduces Memory-Adjusted FLOPs (MAFs) to enforce realistic compute budgets that account for both operation count and peak memory usage, leveling the playing field between methods
Demonstrates that model merging (averaging weights of old and new models) is a superior strategy for 'minor' continual updates compared to standard continual learning or LoRA

Architecture

The FoMo-in-Flux pipeline: Pretraining -> Sequential Updates -> Evaluation. Shows the data mixing strategy and iterative update process.

Evaluation Highlights

Model Merging (averaging finetuned and zero-shot weights) achieves the best trade-off, maintaining high Zero-Shot Retention (>68%) while significantly improving Knowledge Accumulation (>55%) compared to LoRA or full finetuning
Parameter-efficient methods like LoRA and VeRA suffer from plasticity issues, failing to learn new tasks as effectively as full finetuning under the same compute budget
Replaying previous adaptation data is more critical than replaying pretraining data; the specific choice of pretraining data pool (e.g., LAION vs CC-3M) significantly impacts zero-shot retention

Breakthrough Assessment

8/10

Provides a comprehensive, practically grounded benchmark that challenges prevailing wisdom about parameter-efficient methods in continual learning, offering a clear 'practitioner's guide' and a new metric (MAFs).

⚙️ Technical Details

Problem Definition

Setting: Continual Multimodal Pretraining (CMP) with sequentially arriving tasks under fixed compute constraints

Inputs: Sequence of image-text pairs D_t for task t, Pretraining data pool P, Buffer B

Outputs: Updated model parameters θ_t maximizing performance on D_t while retaining zero-shot capabilities on hold-out sets

Pipeline Flow

Sample Task Data D_t → Mix with Buffer B and Pretraining P
Update Model θ_{t-1} to θ_t using Method M (constrained by MAFs)
Update Buffer B with D_t → Repeat for next task

System Modules

Data Mixer

Create training batch by sampling from current task, memory buffer, and pretraining pool

Model or implementation: N/A

Updater

Update model parameters using specified CL or PEFT method

Model or implementation: CLIP (ViT-B/16 or similar)

Novel Architectural Elements

Memory-Adjusted FLOPs (MAFs) constraint mechanism: explicitly limits training steps based on both compute and memory footprint of the chosen method

Modeling

Base Model: CLIP ViT-B/16 pretrained on LAION-2B (default)

Training Method: Continual Pretraining via various strategies (Finetuning, LoRA, Model Merging, EWC, etc.)

Objective Functions:

Purpose: Align image and text representations.

Formally: InfoNCE / Contrastive Loss

Adaptation: LoRA (rank=4), Full Finetuning, Model Merging, EWC, SI, GaLore

Trainable Parameters: Varies by method (Full: all params; LoRA: adapter weights; BitFit: biases only)

Training Data:

41 Adaptation datasets (1.7M samples)
22 Hold-out datasets (Evaluation only)
Pretraining pools: LAION-400M, CC-12M, CC-3M, DataComp-Small

Key Hyperparameters:

optimizer: AdamW
batch_size: 512
learning_rate_schedule: Cosine decay with linear warmup (10%)
+ 2 more
temperature_initialization: 0.01
mixing_ratios: λ_P=0.33, λ_D=0.34, λ_B=0.33 (default)

Compute: Budget fixed at 1.8e9 GFLOPs (DataComp-Small budget) divided by number of steps

Comparison to Prior Work

vs. TiC-DataComp: Focuses on 'minor' updates with high-quality, controlled data streams rather than massive 'major' updates on noisy web data
vs. Split-ImageNet: Uses multimodal (image-text) data and checks zero-shot retention, not just classification accuracy on disjoint classes
vs. Standard LoRA usage: Evaluates LoRA in a continual sequence under compute constraints, revealing plasticity limitations
+ 1 more
vs. Model Merging (standard): Applies merging sequentially at each step of a continual stream, not just once at the end

Limitations

Focuses primarily on CLIP-style models; impact on generative models (diffusion/LLMs) not fully explored
Compute budget is fixed across all steps; dynamic allocation based on task difficulty is not explored
Relies on generated captions for classification datasets, which may introduce noise despite curation
Does not explore retrieval-augmented generation (RAG) approaches for the knowledge base

Reproducibility

Code: https://github.com/ExplainableML/fomo_influx

Code and benchmark available at github.com/ExplainableML/fomo_influx. Datasets are public or provided (Obscure Animals/Things). Captions generated via BLIP-2/CapsFusion are provided. Compute budgets are explicitly defined via MAFs.

📊 Experiments & Results

Evaluation Setup

Sequential training on 20 adaptation tasks (T=20), evaluated after each step

Benchmarks:

FoMo-in-Flux Adaptation Set (Knowledge Accumulation (41 datasets)) [New]
FoMo-in-Flux Hold-out Set (Zero-Shot Retention (22 datasets)) [New]

Metrics:

Knowledge Accumulation (AKA): Average accuracy/recall on adaptation datasets
Zero-Shot Retention (AZS): Average accuracy/recall on hold-out datasets
Memory-Adjusted FLOPs (MAFs): Compute cost metric
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Model Merging is the superior strategy: Merging the finetuned model with the pre-update model (e.g., weight 0.9) retains zero-shot capabilities almost perfectly while allowing significant knowledge accumulation.
Parameter-Efficient methods (LoRA) prioritize stability over plasticity: They struggle to learn new tasks effectively compared to full finetuning when constrained by the same compute budget.
Meta-Learning Rate Schedules: Resetting the learning rate for each new task (meta-schedule) is crucial for long-term adaptation; a single decaying schedule causes learning to stall.
Stream Ordering Matters: 'Easy-to-Hard' and 'Concept Frequency' orderings affect the accumulation rate, but final performance often converges if the underlying data distribution is effectively covered.
IID-fying the stream is critical: Mixing current task data with a buffer of past tasks is essential to prevent catastrophic forgetting; mixing with pretraining data is secondary for adaptation but helps retention.

📚 Prerequisite Knowledge

Prerequisites

Vision-Language Models (CLIP)
Continual Learning (Catastrophic Forgetting, Plasticity-Stability Trade-off)
Parameter-Efficient Finetuning (LoRA)
Model Merging

Key Terms

FoMo-in-Flux: Foundation-Models-in-Flux, the proposed benchmark containing 63 datasets for simulating realistic continual pretraining streams

MAFs: Memory-Adjusted FLOPs—a metric combining FLOP counts with peak device memory usage to budget compute resources fairly across methods

Model Merging: Technique of linearly combining the weights of a finetuned model and the original pretrained model to balance new knowledge and old capabilities

Plasticity: The ability of a model to learn new information from the current task

Stability: The ability of a model to retain previously learned information (prevent forgetting)

LoRA: Low-Rank Adaptation—a parameter-efficient finetuning method that injects trainable low-rank decomposition matrices into model layers

EWC: Elastic Weight Consolidation—a regularization-based continual learning method that penalizes changes to important parameters

Zero-Shot Retention: Performance of the continually updated model on a held-out set of datasets it was never trained on

Knowledge Accumulation: Performance of the model on the specific downstream tasks it has been adapted to sequentially