MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

📝 Paper Summary

Multi-modal Contrastive Learning Long-tail Learning

MM-TS improves multi-modal contrastive learning on long-tailed data by dynamically scheduling the temperature parameter and adjusting it per-sample based on estimated cluster density.

Core Problem

Standard multi-modal contrastive learning uses fixed temperatures that fail to balance instance discrimination (needed for rare tail classes) and group-wise discrimination (needed for common head classes).

Why it matters:

Real-world multi-modal datasets (like video-text) follow long-tail distributions where rare concepts are underrepresented.
Fixed temperatures treat all samples equally: low temperatures help rare instances but break semantic clusters, while high temperatures help clusters but merge distinct rare instances.
Prior temperature scheduling methods were limited to uni-modal settings and did not account for the specific density of multi-modal data points.

Concrete Example: In a dataset with many 'cat' images but few 'parrot' images, a high fixed temperature might cluster all animals together (losing the parrot), while a low fixed temperature might treat every cat image as a distinct instance (losing the concept of 'cat'). MM-TS assigns high temperatures to cats (grouping them) and low temperatures to parrots (isolating them).

Key Novelty

Multi-Modal Temperature and Margin Schedules (MM-TS)

Dynamic Cosine Schedule: Gradually varies the base temperature during training to alternate between learning fine-grained instance features (low temp) and broad semantic clusters (high temp).
Density-Aware Adjustment: Uses text clustering to estimate concept frequency; adds a per-sample temperature shift so common concepts get higher temperatures (more clustering) and rare concepts get lower temperatures (more separation).
Unified Loss Application: Applies this scheduling logic not just to InfoNCE loss (via temperature) but also to Max-Margin loss (via margin), generalizing the benefit.

Architecture

Pipeline for estimating individual temperatures using text data.

Evaluation Highlights

Achieves state-of-the-art performance on EPIC-KITCHENS-100 (video-text retrieval), outperforming baselines.
Improves zero-shot retrieval on Flickr30K and MSCOCO when pretrained on CC3M compared to fixed-temperature baselines.
Demonstrates consistent gains on YouCook2 video-language tasks.

Breakthrough Assessment

7/10

A strong methodological improvement for the specific problem of long-tail multi-modal learning. It smartly adapts uni-modal insights (temperature schedules) to multi-modal settings using text-based density estimation.

⚙️ Technical Details

Problem Definition

Setting: Multi-modal contrastive learning aligning visual inputs (images/videos) and textual descriptions

Inputs: Set of visual inputs {v_1, ..., v_N} and corresponding texts {t_1, ..., t_N}

Outputs: Learned joint embedding space maximizing similarity between matched pairs

Pipeline Flow

Distribution Estimation (Text Clustering)
Dynamic Temperature Calculation (Cosine Schedule + Shift)
Contrastive Training (InfoNCE or Max-Margin)

System Modules

Distribution Estimator

Approximate visual density using text modality

Model or implementation: BERT (for embeddings) + K-Means

Temperature Scheduler

Calculate per-sample temperature at each step

Model or implementation: Analytical function

Contrastive Loss

Compute loss using dynamic temperatures

Model or implementation: InfoNCE or Max-Margin

Novel Architectural Elements

Injection of text-derived density estimates into the temperature parameter of the loss function per-sample.
Substitution of the static margin parameter in Max-Margin loss with a dynamic, schedule-based temperature parameter.

Modeling

Base Model: CLIP-style dual encoder (architecture varies by benchmark, typically ViT/ResNet for vision, Transformer for text)

Training Method: Multi-modal contrastive training with dynamic temperature

Objective Functions:

Purpose: InfoNCE with dynamic temperature.

Formally: L_InfoNCE = -log(exp(s_ii / tau_i) / sum(exp(s_ij / tau_i))) where tau_i varies per sample.
Purpose: Max-Margin with dynamic margin.

Formally: L_MM = max(0, tau_i - s_ii + s_ij) where margin m is replaced by tau_i.

Training Data:

CC3M (image-text)
EPIC-KITCHENS-100 (video-text)
YouCook2 (video-text)

Key Hyperparameters:

alpha: Range of base temperature cosine schedule
sh+: Shift for largest clusters
sh-: Shift for smallest clusters
+ 1 more
K: Number of clusters for K-means (distribution estimation)

Compute: Negligible overhead over standard contrastive training (clustering is one-off).

Comparison to Prior Work

vs. CLIP: MM-TS varies temperature over time AND per sample based on density, rather than a single global scalar.
vs. TempNet: MM-TS uses a simpler non-parametric heuristic (text clustering) rather than training a separate network for temperature prediction.
vs. Uni-modal Temperature Schedules [23]: Extends the concept to multi-modal data and introduces the density-based shift derived from the text modality.

Limitations

Relies on text modality to estimate visual distribution; assumes text-visual alignment is strong enough for this proxy to hold.
Requires K-Means clustering as a preprocessing step.
Hyperparameters for shifts (sh+, sh-) and schedule range need tuning.

Reproducibility

Code: https://github.com/SergShel/MM-TS

Code is publicly available at https://github.com/SergShel/MM-TS. The method relies on standard datasets (CC3M, EPIC-KITCHENS) and standard backbones (CLIP, BERT).

📊 Experiments & Results

Evaluation Setup

Zero-shot retrieval (Text-to-Image, Image-to-Text) and Video-Text Retrieval

Benchmarks:

Flickr30K (Image-Text Retrieval (Zero-shot))
MSCOCO (Image-Text Retrieval (Zero-shot))
EPIC-KITCHENS-100 (Video-Text Retrieval (Fine-grained))
YouCook2 (Video-Text Retrieval)

Metrics:

Recall@1 (R@1)
Recall@5 (R@5)
Recall@10 (R@10)
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Conceptual illustration of temperature impact on gradients/forces.

Visualization of the temperature schedule over training iterations.

Main Takeaways

Dynamic temperature scheduling improves performance across both image-text and video-text retrieval benchmarks.
The method is effective for both InfoNCE and Max-Margin loss functions, showing generalization.
Using text clusters to approximate visual density is an effective strategy for long-tail multi-modal data.
Improvements are observed in both zero-shot settings (CC3M pretraining) and fine-tuning settings (EPIC-KITCHENS).

📚 Prerequisite Knowledge

Prerequisites

Contrastive Learning (InfoNCE loss)
Max-Margin Loss
Long-tail distributions
Cosine similarity

Key Terms

InfoNCE: A loss function used in contrastive learning that maximizes the probability of selecting the correct positive sample from a set of negatives.

Max-Margin Loss: A loss function that enforces a minimum distance (margin) between positive and negative pairs in the embedding space.

Temperature (tau): A scalar parameter in InfoNCE that controls the sharpness of the probability distribution; low tau focuses on hard negatives, high tau treats negatives more uniformly.

Long-tail data: Data distributions where a few classes are very frequent (head) and many classes are rare (tail).

Instance discrimination: Learning to distinguish every single data point as unique (encouraged by low temperature).

Group-wise discrimination: Learning to group semantically similar data points together (encouraged by high temperature).

CC3M: Conceptual Captions 3M, a large-scale dataset of image-caption pairs.

EPIC-KITCHENS-100: A large-scale egocentric video dataset recording daily kitchen activities.

YouCook2: A dataset of cooking videos with recipe texts.