Evaluation Setup
Multimodal instruction tuning followed by zero-shot evaluation on diverse benchmarks.
Benchmarks:
- LLaVA-Wild Bench (Wild/in-the-wild chat)
- MMT-Bench (Multimodal multi-task)
- HallusionBench (Hallucination evaluation)
- SEED-Bench (General multimodal evaluation)
- VQAv2 (Visual Question Answering)
Metrics:
- Accuracy
- Score (custom per benchmark)
- Statistical methodology: Not explicitly reported in the paper
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| Results demonstrating CADC's efficiency on SmolVLM-256M, surpassing full dataset performance with small subsets. |
| Average across benchmarks |
Relative Performance (%) |
100.0 |
107.1 |
+7.1
|
| Comparison against SOTA pruning methods on LLaVA-v1.5-7B showing superiority with smaller budgets. |
| LLaVA-Wild |
Score |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
| HallusionBench |
Score |
Not reported in the paper |
Not reported in the paper |
Not reported in the paper
|
Main Takeaways
- CADC with 5% data consistently outperforms full-data training (100% budget) across multiple models (SmolVLM, LLaVA-7B).
- Discovered capabilities often diverge from human-defined task labels (e.g., 'hallucination' tasks split into recognition vs reasoning capabilities).
- Sequencing training data (Structural Grounding -> Perceptual Recognition -> Symbolic Reasoning) improves performance over random ordering.
- The method transfers well: subsets selected for a small model (256M) work effectively for larger models (2.2B).