Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

📝 Paper Summary

Computer Vision Robustness Model Interpretability Vision Transformers (ViT)

CFT improves Vision Transformer robustness by fine-tuning models to align their internal relevance maps with fine-grained semantic concept masks generated by LLMs and VLMs, rather than relying on coarse foregrounds or spurious backgrounds.

Core Problem

Vision Transformers (ViTs) often rely on spurious correlations (like background textures) instead of semantic features, leading to failure on out-of-distribution data.

Why it matters:

Models that rely on shortcuts (e.g., 'blue background' = 'bird') fail catastrophically in real-world deployments where contexts vary (e.g., natural adversarial examples).
Existing regularization methods rely on binary foreground-background masks which are too coarse to capture internal semantic structure (e.g., 'beak' vs 'wing').
Prior approaches often require expensive ground-truth segmentation masks or full model retraining, limiting scalability.

Concrete Example: A baseline model misclassifies a 'common newt' as a 'scorpion' because it attends to the textured background. After CFT, the model correctly focuses on the newt's body and corrects the prediction (see Figure 2 in paper).

Key Novelty

Concept-Guided Fine-Tuning (CFT)

Uses an LLM to generate text concepts (e.g., 'long beak') and a VLM to ground them visually, creating semantic masks without manual annotation.
Optimizes the model's internal attention (via AttnLRP) to align with these semantic masks while suppressing background relevance.
Introduces a lightweight fine-tuning stage requiring only a few examples (3 images per class) to steer pre-trained models.

Evaluation Highlights

Improvement on natural adversarial examples (ImageNet-A) and viewpoint variations (ObjectNet) across multiple ViT architectures.
Requires only 1,500 training images (3 per class for half of ImageNet classes) to achieve robustness gains.
Resulting relevance maps show stronger alignment with ground-truth object segmentation masks compared to baselines.

Breakthrough Assessment

7/10

Offers a practical, automated path to robustness that bridges interpretability and performance. High data efficiency is a significant plus, though reliance on external oracle models (LLM/VLM) for supervision is a constraint.

⚙️ Technical Details

Problem Definition

Setting: Post-hoc fine-tuning of a pretrained Vision Transformer f_theta to improve OOD robustness using a small dataset D.

Inputs: Input image I, Pretrained ViT parameters theta

Outputs: Optimized parameters theta* that produce concept-aligned relevance maps and correct class predictions.

Pipeline Flow

Group 1: Guidance Generation (Offline): LLM → Concepts → VLM → Semantic Masks
Group 2: Model Optimization (Online): Image → ViT → Relevance Map → Alignment Loss

System Modules

Concept Generator (Guidance Generation)

Propose textual attributes that define a class (e.g., 'wings', 'beak' for bird)

Model or implementation: GPT-4o-mini

Concept Grounder (Guidance Generation)

Spatially localize the text concepts in training images to create masks

Model or implementation: GroundedSAM (GroundingDINO + SAM)

Relevance Extractor (Model Optimization)

Compute the model's internal attribution map for the predicted class

Model or implementation: AttnLRP (Attention-aware Layer-wise Relevance Propagation)

Alignment Optimizer (Model Optimization)

Update model weights to align relevance with concept masks and maintain accuracy

Model or implementation: Standard Gradient Descent (AdamW)

Novel Architectural Elements

Integration of AttnLRP relevance maps directly into the loss function as a differentiable target for fine-tuning

Modeling

Base Model: Evaluated on ViT-B, DINOv2, DeiT-III, and ConvNeXt-V2

Training Method: Fine-tuning with Relevance Alignment

Objective Functions:

Purpose: Maximize relevance inside concept regions.

Formally: L_concept = - sum_{p in S} log(Phi_p(I))
Purpose: Suppress relevance in background regions.

Formally: L_non-concept = sum_{p not in S} Phi_p(I)
Purpose: Maintain classification accuracy/confidence.

Formally: L_cls = CrossEntropy(f_theta(I), OneHot(argmax f_theta(I)))

Adaptation: Full fine-tuning of pretrained weights

Training Data:

Sampled 3 images per class for half of ImageNet-1K classes (500 classes)
Total finetuning dataset size: 1,500 images

Key Hyperparameters:

learning_rate: Grid search in [5e-7, 5e-6]
batch_size: 8
epochs: 50
+ 4 more
lambda_non_concept: 1.2
lambda_concept: 0.5
lambda_align: 0.8
lambda_cls: 0.2

Compute: Trained on NVIDIA A100 GPUs

Comparison to Prior Work

vs. GradMask/RRR: CFT uses fine-grained semantic concepts (via VLM) rather than binary foreground masks, and AttnLRP instead of unstable input gradients.
vs. Standard Fine-tuning: CFT requires significantly fewer data points (1500 total) to achieve robustness by optimizing reasoning rather than just accuracy.
vs. CAST [not cited in paper]: CAST uses saliency guidance for measurably better grounding, but CFT explicitly integrates linguistic concepts via VLMs.

Limitations

Relies on the quality of external concept generation (LLM) and grounding (VLM) models; errors in masks propagate to the fine-tuning.
Less effective on artistic/abstract datasets (ImageNet-R/Sketch) where background bias is less prevalent.
Requires AttnLRP computation during training, which increases memory and computational cost compared to standard backpropagation.

Reproducibility

Code: https://github.com/yonisGit/cft

Code provided at https://github.com/yonisGit/cft. Pretrained weights for baselines sourced from timm library. Concept sets generated via GPT-4o-mini (proprietary dependency).

📊 Experiments & Results

Evaluation Setup

Classification robustness evaluation on Out-of-Distribution (OOD) benchmarks after minimal fine-tuning.

Benchmarks:

ImageNet-A (Natural Adversarial Examples)
ObjectNet (Viewpoint/Background Variation)
SI-Score (Synthetic Invariance (Scale/Rotation))
ImageNet-R (Art/Renditions)

Metrics:

Top-1 Accuracy
Top-5 Accuracy
mIoU (for relevance map alignment)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ImageNet-1K Subset	Samples per Class	1300	3	-1297
Concept Validation	Total Concepts	0	1852	1852

Experiment Figures

Qualitative comparison of relevance maps and predictions on ImageNet-A and ObjectNet.

Main Takeaways

CFT improves robustness on real-world OOD benchmarks (ImageNet-A, ObjectNet) where background cues are misleading, correcting specific failure modes like misclassifying objects due to texture.
The method is highly data-efficient, showing gains with only 3 images per class (1,500 images total), making it suitable for adapting large pretrained models.
Relevance maps fine-tuned with CFT align significantly better with ground-truth object segmentation masks compared to baselines (GradMask, RRR), validating that the model is learning to look at the object.
Gains are less pronounced on abstract datasets (ImageNet-Sketch, ImageNet-R) where background correlations are naturally minimized.
Robustness improvements generalize to held-out classes not seen during the fine-tuning process, suggesting the model learns a generalizable reasoning mechanism.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Vision Transformers (ViT) and Attention mechanisms
Familiarity with Layer-wise Relevance Propagation (LRP)
Knowledge of Vision-Language Models (VLMs) for segmentation

Key Terms

ViT: Vision Transformer—a neural network architecture for image processing based on the Transformer mechanism originally designed for NLP.

OOD: Out-of-Distribution—data that differs significantly from the training distribution (e.g., sketches vs. photos).

AttnLRP: Attention-aware Layer-wise Relevance Propagation—an interpretability method specifically designed to trace relevance through Transformer attention layers faithfully.

Spurious Correlations: Patterns in data (like background grass for a cow) that are predictive in the training set but do not essentially define the class.

GroundedSAM: A model combining Grounding DINO (text-to-box) and SAM (Segment Anything Model) to generate segmentation masks from text prompts.

VLM: Vision-Language Model—a model capable of processing and relating both image and text inputs.

IoU: Intersection over Union—a metric for evaluating segmentation overlap.

LRP: Layer-wise Relevance Propagation—a technique for determining which pixels contributed most to a neural network's decision.