Uncovering Intrinsic Capabilities: A Paradigm for Data Curation in Vision-Language Models

📝 Paper Summary

Instruction Tuning Data Curation Curriculum Learning

CADC optimizes vision-language model instruction tuning by discovering latent capabilities from training dynamics and selecting data that balances and sequences these capabilities.

Core Problem

Reducing instruction tuning data often degrades model performance because heuristic selection methods treat models as black boxes, ignoring the specific latent capabilities required for complex tasks.

Why it matters:

Standard data pruning methods (like heuristic filtering) often inadvertently remove data critical for specific skills (e.g., grounding or recognition), causing performance regression.
Real-world tasks require multiple complementary capabilities; optimizing for one (like reasoning) while neglecting others leads to imbalanced models.

Concrete Example: Analyzing a chemical reaction diagram requires recognition (identifying atoms) and reasoning (deducing pathways). If data selection disproportionately targets reasoning, the model fails because it loses the grounding capability needed to identify the atoms in the first place.

Key Novelty

Capability-Attributed Data Curation (CADC)

Discovers intrinsic capabilities (latent skills) by clustering validation tasks based on the similarity of their gradient update trajectories during training.
Attributes training data to these discovered capabilities by measuring how much a training sample's gradient aligns with the gradients of capability-specific validation sets.
Curates a curriculum that balances data across capabilities and sequences them from fundamental to complex based on self-influence learning curves.

Architecture

High-level overview of the CADC framework.

Evaluation Highlights

Surpasses full-data training (100% budget) using only 5% of the original data on SmolVLM-256M (107.1% relative performance).
Outperforms state-of-the-art pruning methods (TIVE, COINCIDE, ICONS) on LLaVA-7B benchmarks while using significantly less data (5% vs 15-20%).
Achieves best or second-best performance across 11 diverse multimodal benchmarks, including LLaVA-Wild, MMBench, and HallusionBench.

Breakthrough Assessment

9/10

Strong conceptual novelty in replacing heuristic data selection with unsupervised capability discovery. The efficiency gain (beating 100% data with 5%) is exceptionally high compared to typical pruning results.

⚙️ Technical Details

Problem Definition

Setting: Instruction tuning of Large Vision-Language Models (VLMs)

Inputs: A large pool of multimodal instruction data D_train and a diverse validation set D_target

Outputs: A curated subset of training data D_subset and a curriculum sequence for fine-tuning

Pipeline Flow

Capability Discovery: Trajectory Tracking → Task Similarity Graph → Community Detection
Data Attribution: Influence Estimation → Soft Assignment
Curation: Budget Allocation → Pool Sampling → Curriculum Sequencing

System Modules

Trajectory Tracker (Capability Discovery)

Record AdamW update vectors for validation subtasks during a warm-up training phase

Model or implementation: VLM (e.g., LLaVA-1.5 or SmolVLM)

Community Detector (Capability Discovery)

Cluster subtasks into intrinsic capabilities based on trajectory similarity

Model or implementation: Leiden Algorithm

Attributor

Assign training samples to capabilities based on gradient alignment

Model or implementation: Cosine Similarity / Influence Function

Curator

Select and sequence data subsets

Model or implementation: Heuristic Algorithm

Novel Architectural Elements

Unsupervised capability discovery module using trajectory clustering on validation tasks
Curriculum sequencing mechanism derived from self-influence temporal profiles

Modeling

Base Model: LLaVA-v1.5-7B and SmolVLM (256M, 500M, 2.2B)

Training Method: Instruction Tuning (Supervised Fine-Tuning)

Objective Functions:

Purpose: Minimize prediction error on next token.

Formally: Standard Cross-Entropy Loss over response tokens

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (for SmolVLM experiments) or Projector+LLM (standard LLaVA tuning)

Training Data:

Source: LLaVA-1.5 Mix665K
Validation for discovery: MMT-Bench (162 subtasks)

Key Hyperparameters:

optimizer: AdamW
batch_size: Not reported in the paper
learning_rate: Not reported in the paper

Compute: Not reported in the paper

Comparison to Prior Work

vs. LESS: CADC attributes data todiscovered latent capabilitiesrather than explicit tasks, and sequences them.
vs. TIVE/COINCIDE: CADC useslearning dynamicsto guide selection rather than static data features or simple diversity metrics.
vs. DeepSpeed Data Efficiency [not cited in paper]: DeepSpeed uses curriculum learning based on difficulty/loss; CADC uses curriculum based oncapability typederived from trajectory clustering.

Limitations

Relies on a diverse validation set (MMT-Bench) to discover capabilities; limited by the validation set's coverage.
Computationally intensive to track gradients for all training and validation data.
Requires re-computing trajectories if the base model changes significantly (though transferability is shown).

Reproducibility

Code availability is not explicitly provided in the text. The method relies on computing gradients for all training samples which can be computationally expensive (though frequency is not specified). MMT-Bench is used as the validation probe.

📊 Experiments & Results

Evaluation Setup

Multimodal instruction tuning followed by zero-shot evaluation on diverse benchmarks.

Benchmarks:

LLaVA-Wild Bench (Wild/in-the-wild chat)
MMT-Bench (Multimodal multi-task)
HallusionBench (Hallucination evaluation)
SEED-Bench (General multimodal evaluation)
VQAv2 (Visual Question Answering)

Metrics:

Accuracy
Score (custom per benchmark)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results demonstrating CADC's efficiency on SmolVLM-256M, surpassing full dataset performance with small subsets.
Average across benchmarks	Relative Performance (%)	100.0	107.1	+7.1
Comparison against SOTA pruning methods on LLaVA-v1.5-7B showing superiority with smaller budgets.
LLaVA-Wild	Score	Not reported in the paper	Not reported in the paper	Not reported in the paper
HallusionBench	Score	Not reported in the paper	Not reported in the paper	Not reported in the paper

Experiment Figures

Performance comparison of CADC vs baselines on SmolVLM-256M and visualization of capability misalignment.

Data-capability alignment and self-influence curves.

Main Takeaways

CADC with 5% data consistently outperforms full-data training (100% budget) across multiple models (SmolVLM, LLaVA-7B).
Discovered capabilities often diverge from human-defined task labels (e.g., 'hallucination' tasks split into recognition vs reasoning capabilities).
Sequencing training data (Structural Grounding -> Perceptual Recognition -> Symbolic Reasoning) improves performance over random ordering.
The method transfers well: subsets selected for a small model (256M) work effectively for larger models (2.2B).

📚 Prerequisite Knowledge

Prerequisites

Gradient-based optimization (AdamW)
Influence functions for data attribution
Community detection (clustering algorithms)
Curriculum learning

Key Terms

intrinsic capabilities: Latent skills (e.g., perceptual recognition, symbolic reasoning) inferred from clustering tasks that induce similar model parameter updates

gradient trajectory: The sequence of parameter updates (or gradients) a model undergoes while learning a specific task or data point

AdamW: A popular optimization algorithm that modifies gradient descent with weight decay and momentum; used here to represent the 'update signal'

influence estimation: A technique to quantify how much a specific training data point contributes to reducing the loss on a validation data point

community detection: An algorithm (Leiden) used to find clusters in a graph; here applied to the task-similarity graph to group tasks into capabilities

self-influence: The cumulative magnitude of a data point's own gradient over training; used to measure learning difficulty

curriculum sequencing: Ordering training data so that fundamental skills are learned before complex ones