Effective pruning of web-scale datasets based on complexity of concept clusters

📝 Paper Summary

Data Pruning / Coreset Selection Multimodal Pretraining (CLIP)

Density-Based Pruning (DBP) reduces web-scale multimodal datasets by clustering embeddings and selecting difficult examples based on cluster density and proximity to other clusters, achieving higher performance with significantly less data.

Core Problem

Training foundation models like CLIP on massive web-scale datasets (e.g., LAION-2B) is prohibitively expensive and inefficient because much of the data is redundant or low-quality.

Why it matters:

High computational and environmental costs of training large-scale models limit research to well-funded industry labs.
Existing pruning methods like CLIP-score filtering focus on individual sample quality but ignore the marginal information gain relative to other samples in the dataset.

Concrete Example: A dataset might contain thousands of nearly identical images of 'golden retrievers'. Random sampling or simple CLIP-score filtering might keep hundreds of these redundant examples, wasting compute, while under-sampling sparser concepts like specific rare birds.

Key Novelty

Density-Based Pruning (DBP)

Clusters dataset embeddings using k-means and calculates a 'complexity' score for each cluster based on its density (intra-cluster distance) and isolation (inter-cluster distance).
Allocates a target number of samples to each cluster proportional to its complexity, then selects the 'hardest' (least prototypical) samples from each cluster to fill that quota.

Architecture

Conceptual flow of the Density-Based Pruning method.

Evaluation Highlights

+1.1 percentage points ImageNet zero-shot accuracy over OpenCLIP-ViT-B/32 baseline while using only 27.7% of the training data/compute.
Achieves new state-of-the-art ImageNet zero-shot accuracy on DataComp Medium benchmark compared to T-MARS and other baselines.
Outperforms training on the full LAION-CAT-440M dataset on retrieval and VTAB tasks despite using only ~50% of the training compute.

Breakthrough Assessment

8/10

Significantly improves data efficiency for CLIP training, effectively challenging the 'more data is better' scaling law by showing careful pruning beats full-scale training.

⚙️ Technical Details

Problem Definition

Setting: Subset selection for Contrastive Language-Image Pretraining (CLIP)

Inputs: Large-scale web dataset D = {(image, caption)}

Outputs: Pruned subset S ⊂ D such that |S| << |D|

Pipeline Flow

SemDeDup (Semantic Deduplication)
CLIP-score Filtering
Density-Based Pruning (Clustering → Complexity Calculation → Sampling)

System Modules

SemDeDup

Remove semantically near-identical duplicates to prevent cluster domination by redundant concepts

Model or implementation: Not specified (uses pre-computed similarities)

CLIP-score Filter

Remove samples where image and text do not align (low cosine similarity)

Model or implementation: OpenAI CLIP-B/32 (LAION) or CLIP-L/14 (DataComp)

K-means Clustering (Density-Based Pruning)

Group data into semantically similar concepts to analyze density

Model or implementation: Distilled DINOv2-L/14 (Image Encoder)

Complexity Scorer (Density-Based Pruning)

Determine how many samples to keep from each cluster based on density and isolation

Model or implementation: Analytical Formula (Eq. 1 & 2)

Hardness Sampler (Density-Based Pruning)

Select specific samples from each cluster to meet target count

Model or implementation: Distance Ranking

Novel Architectural Elements

Concept-specific pruning rate adapted to cluster complexity (unlike fixed rates in SSP)
Combination of intra-cluster density and inter-cluster isolation to define complexity
Quadratic programming solver to optimize sampling probabilities under dataset constraints

Modeling

Base Model: CLIP-ViT-B/32

Training Method: Contrastive Language-Image Pretraining (CLIP)

Objective Functions:

Purpose: Maximize similarity between matched image-text pairs and minimize it for unmatched pairs.

Formally: InfoNCE loss / Contrastive loss

Training Data:

LAION-CAT-440M (filtered to subsets)
DataComp Medium (128M raw, filtered to ~19M)

Key Hyperparameters:

epochs: 32 (LAION experiments)
samples_seen: 128 million (DataComp protocol)
batch_size: Not explicitly reported in the paper
+ 4 more
learning_rate: Not explicitly reported in the paper
k_clusters: 500 (LAION), 100 (DataComp)
softmax_temperature_tau: 0.1
nearest_neighbors_l: 20

Compute: Not reported in the paper

Comparison to Prior Work

vs. SSP-Pruning: Adapts pruning rate per concept based on complexity rather than fixed balancing; scales to web-scale data via deduplication.
vs. CLIP-score / T-MARS: Considers dataset distribution (density) rather than just individual sample quality.
vs. Random Sampling: Significantly improves data efficiency by keeping informative, hard examples.
+ 1 more
vs. Coreset Selection [not cited in paper]: Focuses on unsupervised multimodal density rather than supervised loss gradients or uncertainty.

Limitations

Requires pre-computation of embeddings and clustering, adding overhead before training starts.
Performance gains on retrieval tasks are sensitive to training duration (shorter training hurts retrieval more).
Relies on the quality of the pre-trained embedding model (DINOv2) for clustering.

Reproducibility

Code: https://github.com/ExplainableML/Density-Based-Pruning

Publicly available code linked (ExplainableML/Density-Based-Pruning). Uses standard OpenCLIP training hyperparameters. Dataset filtering relies on SemDeDup and specific pre-trained embeddings (DINOv2).

📊 Experiments & Results

Evaluation Setup

Zero-shot transfer learning on image classification and retrieval tasks

Benchmarks:

ImageNet (IN-1K) (Zero-shot Image Classification)
ImageNet Distribution Shifts (Robustness Evaluation (ImageNet-A, R, V2, Sketch))
VTAB (Visual Task Adaptation (19 diverse tasks))
Retrieval Tasks (Image-Text Retrieval (Flickr30k, MSCOCO))
DataComp Medium (Benchmark Suite (38 tasks))

Metrics:

Top-1 Zero-shot Accuracy
Average Zero-shot Accuracy (DataComp)
Recall@1 (Retrieval)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on LAION subsets showing efficiency gains. Training on 27.7% of data outperforms the full OpenCLIP baseline.
ImageNet	Zero-shot Accuracy (%)	62.92	65.44	+2.52
DataComp Medium Benchmark results compared to state-of-the-art filtering methods.
DataComp Medium (ImageNet)	Zero-shot Accuracy (%)	36.9	39.2	+2.3
DataComp Medium (VTAB)	Average Accuracy (%)	46.1	48.4	+2.3
DataComp Medium (Retrieval)	Average Recall (%)	26.9	30.4	+3.5

Experiment Figures

Performance vs. Training Cost (Billions of Samples Seen) on LAION-CAT-440M subsets.

Ablation of different embedding models for clustering (CLIP vs DINOv2 vs BLIP).

Main Takeaways

Pruning massive datasets to smaller, high-quality subsets can yield better models than training on the full noisy dataset.
Adapting pruning rates to concept complexity (DBP) is superior to fixed balancing (SSP) or random sampling.
Hard negative mining (removing easy prototypes) is effective for contrastive pretraining at web scale.
Distilled DINOv2 embeddings provide better clustering signals for pruning than CLIP embeddings.
Retrieval and distribution shift tasks benefit more from longer training duration compared to standard ImageNet classification.

📚 Prerequisite Knowledge

Prerequisites

Contrastive Learning (CLIP)
K-means clustering
Embedding spaces
Zero-shot classification

Key Terms

CLIP: Contrastive Language-Image Pretraining—a model trained to match images with their corresponding text captions

LAION: Large-scale Artificial Intelligence Open Network—a massive open dataset of image-text pairs used for training multimodal models

Zero-shot accuracy: The ability of a model to classify images into categories it has not explicitly seen during training, using only class names/descriptions

DINOv2: A self-supervised vision model used here to generate high-quality image embeddings for clustering

SemDeDup: Semantic Deduplication—a method to remove semantically redundant image pairs based on embedding similarity

SSP-Pruning: Self-Supervised-Prototypes Pruning—a prior method that prunes data by removing 'prototypical' (easy) samples close to cluster centroids

DataComp: A benchmark competition focusing on dataset curation for multimodal model training

ITM: Image-Text Matching—a score indicating how well an image matches its caption

VTAB: Visual Task Adaptation Benchmark—a suite of diverse vision tasks used to evaluate transfer learning