← Back to Paper List

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Peng Sun, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li
arXiv (2026)
MM QA Reasoning

📝 Paper Summary

Data Selection / Data Pruning Visual Instruction Tuning (VIT)
CVS selects high-quality visual instruction data by measuring how much the question changes a frozen model's judgment of the answer's validity, prioritizing samples that genuinely require visual reasoning.
Core Problem
Many multimodal instruction samples can be solved using linguistic shortcuts or common sense without looking at the image, providing weak supervision that degrades the model's visual reasoning capabilities.
Why it matters:
  • Datasets are polluted with samples where questions are irrelevant or answers are obvious from text alone, wasting compute and encouraging hallucination.
  • Existing selection methods rely on training costly proxy models or measuring diversity, which fails to capture whether a specific question actually necessitates visual evidence.
Concrete Example: A model might correctly answer 'Yes' to 'Is there a dog?' not because it sees a dog, but because the text prior makes 'Yes' the most likely completion. CVS identifies this by checking if removing the question 'Is there a dog?' changes the probability of the answer 'Yes'. If the probability doesn't change, the question didn't matter.
Key Novelty
Conditional Verdict Shift (CVS)
  • Uses a frozen VLLM as an evaluator to check if the question provides information gain regarding the answer's validity.
  • Compares the probability of the answer being valid (outputting 'Yes') given the full context (Image + Question) versus the reduced context (Image only).
  • Selects samples where the question increases confidence in the answer ('Visual Necessity') while filtering samples where the question increases rejection ('Semantic Conflict').
Architecture
Architecture Figure Figure 1(b)
The CVS pipeline for data selection.
Evaluation Highlights
  • Outperforms full-data training on Vision-Flan by 3.5% using only 10% of the data.
  • Surpasses full-data training on Vision-Flan by 4.8% using only 15% of the data.
  • Reduces computational cost by 44.4% compared to the XMAS data selection method on The Cauldron dataset.
Breakthrough Assessment
7/10
Strong efficiency gains (training-free) and clear performance improvements with very small data subsets. The 'conditional shift' intuition is elegant and seemingly effective against linguistic shortcuts.
×