On the Comparison between Multi-modal and Single-modal Contrastive Learning

📝 Paper Summary

Contrastive Learning Theoretical Deep Learning Feature Learning

Multi-modal contrastive learning generalizes better than single-modal learning because high-quality signals in one modality help suppress noise memorization in the other through cooperative feature learning.

Core Problem

While multi-modal models like CLIP achieve superior robustness and transferability compared to single-modal baselines, the theoretical mechanism explaining this performance gap—specifically regarding optimization dynamics and feature learning—remains unexplained.

Why it matters:

Foundation models (FMs) like CLIP rely on multi-modal pre-training, but we lack a theoretical understanding of why adding modalities improves generalization
Existing theory focuses on single-modal or linear settings, failing to explain the interplay between signal learning and noise memorization in deep non-linear networks
Understanding this mechanism is crucial for designing better pre-training objectives and data selection strategies for large-scale models

Concrete Example: In single-modal learning with image augmentation, the augmentation often preserves the same noise level as the original image. Consequently, the model memorizes these noise features (spurious correlations) to minimize contrastive loss. In multi-modal learning (e.g., image-text), the text modality (signal) is uncorrelated with the image noise, allowing the model to filter out image noise and focus on the shared semantic signal.

Key Novelty

Signal-Noise Cooperation Theory for Contrastive Learning

Models the training process as a competition between learning shared semantic signals (signal learning) and memorizing random data patterns (noise memorization)
Demonstrates that in multi-modal learning, a high-quality second modality acts as a guide, accelerating signal learning in the first modality while suppressing noise
Proves that single-modal learning is fundamentally limited by the signal-to-noise ratio (SNR) of augmentations, leading to unavoidable noise memorization

Architecture

Though no explicit block diagram is provided, the paper describes a dual-encoder architecture where gradients from the contrastive loss update both encoders simultaneously, allowing signal information to flow between modalities.

Evaluation Highlights

Theoretical proof: Multi-modal contrastive learning achieves vanishing downstream test error (o(1)), while single-modal learning suffers constant error (Theta(1))
Synthetic experiments: Multi-modal learning achieves near 1.0 test accuracy on OOD tasks, while single-modal stagnates near 0.5
+69.45% accuracy gain on ColoredMNIST (82.13% vs 12.68%) by using multi-modal supervision to ignore spurious color correlations

Breakthrough Assessment

8/10

Provides the first unified theoretical framework analyzing optimization and generalization for both single- and multi-modal contrastive learning in non-linear networks, rigorously explaining the empirical success of CLIP-like models.

⚙️ Technical Details

Problem Definition

Setting: Pre-training two encoders via contrastive learning on data generated from a signal-noise model, followed by linear probing on a downstream classification task

Inputs: Pairs of multi-modal data (x, x_tilde) where x contains signal mu and noise xi, or single-modal pairs (x, x_aug) where x_aug is an augmentation

Outputs: Learned representations h(x) used for downstream linear classification

Pipeline Flow

Data Generation (Signal + Noise model)
Encoder f (Single-modal) OR Encoders f, g (Multi-modal)
Contrastive Loss Calculation (InfoMax)
Gradient Descent Optimization
Downstream Linear Probing

System Modules

Encoder (Single-modal) (Representation Learning)

Maps input x to embedding h(x)

Model or implementation: Two-layer ReLU network (m neurons)

Encoders (Multi-modal) (Representation Learning)

Map inputs x and x_tilde to embeddings h(x) and g(x_tilde)

Model or implementation: Two separate two-layer ReLU networks

Novel Architectural Elements

Unified theoretical analysis framework for comparing single-modal vs. multi-modal optimization trajectories
Trajectory-based analysis decomposing weights into signal components (gamma) and noise components (rho)

Modeling

Base Model: Two-layer ReLU neural network (theoretical model)

Training Method: Gradient Descent on Contrastive Loss

Objective Functions:

Purpose: Maximize similarity between positive pairs and minimize similarity between negative pairs.

Formally: L = -sum log( exp(Sim(pos)/tau) / [exp(Sim(pos)/tau) + sum exp(Sim(neg)/tau)] )

Key Hyperparameters:

learning_rate: 0.01
hidden_size_m: 50
input_dimension_d: 2000
+ 2 more
sample_size_n: 100
epochs: 200

Comparison to Prior Work

vs. SimCLR: Theory shows SimCLR is limited by augmentation quality/SNR; this paper proves multi-modal learning overcomes this via independent noise in the second modality.
vs. [Wenlong Ji et al. 2023]: Extends feature learning analysis from single-modal to multi-modal settings.
vs. [Zixiang Chen et al. 2023]: Analyzes optimization dynamics and noise memorization explicitly, rather than just assuming transferable representations.

Limitations

Analysis is limited to two-layer ReLU networks (shallow architectures)
Assumes a specific Signal-Noise data generation model
Requires high-dimensional setting (d >> n) for concentration results
Does not account for pre-trained language models (assumes training from scratch)

Reproducibility

Theoretical proofs are fully provided in appendices. Synthetic experiment details (dimensions, noise levels, learning rates) are specified. ColoredMNIST setup follows standard literature (Arjovsky et al.). Code URL is not provided in the paper.

📊 Experiments & Results

Evaluation Setup

Pre-training on synthetic or real data, followed by linear classification on OOD test sets

Benchmarks:

Synthetic Signal-Noise Dataset (Binary Classification (OOD)) [New]
ColoredMNIST (Image Classification with Spurious Correlations)

Metrics:

Training Loss
Test Accuracy (OOD)
Signal Learning (magnitude of weight projection on signal)
Noise Memorization (magnitude of weight projection on noise)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Real-world validation on ColoredMNIST shows multi-modal learning successfully ignores spurious color correlations that single-modal learning memorizes.
ColoredMNIST	Test Accuracy	12.68%	82.13%	+69.45%
Synthetic experiments confirm theoretical predictions: multi-modal learning exhibits superior signal learning and suppressed noise memorization.
Synthetic Dataset	Test Accuracy	0.5	1.0	+0.5

Experiment Figures

Trajectories of Training Loss, Test Accuracy, Signal Learning, and Noise Memorization over epochs for both Single-modal and Multi-modal settings.

Main Takeaways

Multi-modal contrastive learning generalizes significantly better than single-modal learning on OOD tasks (o(1) vs Theta(1) error)
The performance gap is driven by 'feature cooperation': high SNR in one modality helps the other modality learn signals and ignore noise
Single-modal learning is fundamentally constrained by the noise level of augmentations; if augmentations preserve noise, the model memorizes it
Noise memorization in single-modal learning is severe and suppresses signal learning, whereas multi-modal learning suppresses noise memorization

📚 Prerequisite Knowledge

Prerequisites

Contrastive Learning (InfoNCE/InfoMax objectives)
Gradient Descent dynamics in neural networks
Feature learning theory (Signal-vs-Noise decomposition)
High-dimensional probability (concentration of measure)

Key Terms

SNR: Signal-to-Noise Ratio—the ratio of the magnitude of the useful semantic feature (signal) to the task-irrelevant variation (noise)

InfoMax: A contrastive learning objective that maximizes mutual information between views (often implemented via InfoNCE loss)

Stop-gradient: An operation that prevents gradients from flowing through one branch of the network, used to stabilize training (e.g., in SimSiam or bootstrapped methods)

ReLU network: A neural network using Rectified Linear Units (f(x) = max(0, x)) as activation functions, allowing non-linear feature interactions

OOD: Out-of-Distribution—test data that comes from a different distribution than the training data (e.g., different noise patterns or spurious correlations)

Signal Learning: The process where the network weights align with the true semantic vector (mu)

Noise Memorization: The process where the network weights align with specific random noise vectors (xi) present in the training samples

Spurious correlation: A connection between a feature (e.g., background color) and a label that holds in training data but not in general (e.g., test data)

ColoredMNIST: A variant of the MNIST digit dataset where digits are colored specifically to introduce spurious correlations (e.g., 0 is usually red in training, but green in testing)