← Back to Paper List

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

Siarhei Sheludzko, Dhimitrios Duka, Bernt Schiele, Hilde Kuehne, Anna Kukleva
Not explicitly listed in text extraction, inferred from authors
arXiv (2026)
MM Pretraining Benchmark

📝 Paper Summary

Multi-modal Contrastive Learning Long-tail Learning
MM-TS improves multi-modal contrastive learning on long-tailed data by dynamically scheduling the temperature parameter and adjusting it per-sample based on estimated cluster density.
Core Problem
Standard multi-modal contrastive learning uses fixed temperatures that fail to balance instance discrimination (needed for rare tail classes) and group-wise discrimination (needed for common head classes).
Why it matters:
  • Real-world multi-modal datasets (like video-text) follow long-tail distributions where rare concepts are underrepresented.
  • Fixed temperatures treat all samples equally: low temperatures help rare instances but break semantic clusters, while high temperatures help clusters but merge distinct rare instances.
  • Prior temperature scheduling methods were limited to uni-modal settings and did not account for the specific density of multi-modal data points.
Concrete Example: In a dataset with many 'cat' images but few 'parrot' images, a high fixed temperature might cluster all animals together (losing the parrot), while a low fixed temperature might treat every cat image as a distinct instance (losing the concept of 'cat'). MM-TS assigns high temperatures to cats (grouping them) and low temperatures to parrots (isolating them).
Key Novelty
Multi-Modal Temperature and Margin Schedules (MM-TS)
  • Dynamic Cosine Schedule: Gradually varies the base temperature during training to alternate between learning fine-grained instance features (low temp) and broad semantic clusters (high temp).
  • Density-Aware Adjustment: Uses text clustering to estimate concept frequency; adds a per-sample temperature shift so common concepts get higher temperatures (more clustering) and rare concepts get lower temperatures (more separation).
  • Unified Loss Application: Applies this scheduling logic not just to InfoNCE loss (via temperature) but also to Max-Margin loss (via margin), generalizing the benefit.
Architecture
Architecture Figure Figure 2
Pipeline for estimating individual temperatures using text data.
Evaluation Highlights
  • Achieves state-of-the-art performance on EPIC-KITCHENS-100 (video-text retrieval), outperforming baselines.
  • Improves zero-shot retrieval on Flickr30K and MSCOCO when pretrained on CC3M compared to fixed-temperature baselines.
  • Demonstrates consistent gains on YouCook2 video-language tasks.
Breakthrough Assessment
7/10
A strong methodological improvement for the specific problem of long-tail multi-modal learning. It smartly adapts uni-modal insights (temperature schedules) to multi-modal settings using text-based density estimation.
×