← Back to Paper List

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

Yehonatan Elisha, Oren Barkan, Noam Koenigstein
Tel Aviv University, The Open University
arXiv (2026)
MM Benchmark Factuality

📝 Paper Summary

Computer Vision Robustness Model Interpretability Vision Transformers (ViT)
CFT improves Vision Transformer robustness by fine-tuning models to align their internal relevance maps with fine-grained semantic concept masks generated by LLMs and VLMs, rather than relying on coarse foregrounds or spurious backgrounds.
Core Problem
Vision Transformers (ViTs) often rely on spurious correlations (like background textures) instead of semantic features, leading to failure on out-of-distribution data.
Why it matters:
  • Models that rely on shortcuts (e.g., 'blue background' = 'bird') fail catastrophically in real-world deployments where contexts vary (e.g., natural adversarial examples).
  • Existing regularization methods rely on binary foreground-background masks which are too coarse to capture internal semantic structure (e.g., 'beak' vs 'wing').
  • Prior approaches often require expensive ground-truth segmentation masks or full model retraining, limiting scalability.
Concrete Example: A baseline model misclassifies a 'common newt' as a 'scorpion' because it attends to the textured background. After CFT, the model correctly focuses on the newt's body and corrects the prediction (see Figure 2 in paper).
Key Novelty
Concept-Guided Fine-Tuning (CFT)
  • Uses an LLM to generate text concepts (e.g., 'long beak') and a VLM to ground them visually, creating semantic masks without manual annotation.
  • Optimizes the model's internal attention (via AttnLRP) to align with these semantic masks while suppressing background relevance.
  • Introduces a lightweight fine-tuning stage requiring only a few examples (3 images per class) to steer pre-trained models.
Evaluation Highlights
  • Improvement on natural adversarial examples (ImageNet-A) and viewpoint variations (ObjectNet) across multiple ViT architectures.
  • Requires only 1,500 training images (3 per class for half of ImageNet classes) to achieve robustness gains.
  • Resulting relevance maps show stronger alignment with ground-truth object segmentation masks compared to baselines.
Breakthrough Assessment
7/10
Offers a practical, automated path to robustness that bridges interpretability and performance. High data efficiency is a significant plus, though reliance on external oracle models (LLM/VLM) for supervision is a constraint.
×