← Back to Paper List

Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder

Alaa Yasser, Kittipat Phunjanna, Marcos Escudero Viñolo, Catarina Barata, Jenny Benois-Pineau
Affiliations not provided in the extracted text segment
arXiv (2026)
MM Benchmark Factuality

📝 Paper Summary

Mechanistic Interpretability Fairness and Bias Auditing
This paper proposes a diagnostic pipeline that locates specific attention heads responsible for demographic bias in CLIP by combining residual-stream decomposition with zero-shot concept projections.
Core Problem
Standard fairness audits quantify *that* a model is biased (e.g., prediction disparities) but cannot explain *where* inside the neural network's architecture this bias originates.
Why it matters:
  • Foundation models like CLIP systematically replicate societal stereotypes (e.g., misclassifying female doctors as nurses) which propagates into downstream multimodal systems
  • Existing interpretability methods locate features like color or texture but haven't been adapted to locate demographic bias in discriminative encoders
  • Without knowing the internal source of bias, mitigation strategies rely on global retraining or output calibration rather than precise architectural surgery
Concrete Example: A CLIP-based occupation classifier misclassifies female doctors as nurses at nearly double the rate of male doctors. Current audits report this percentage but cannot identify which specific attention heads in the transformer are routing the 'female' signal to the 'nurse' prediction.
Key Novelty
Bias-Augmented Projected Residual Stream Analysis
  • Injects demographic prototypes into the TextSpan dictionary, forcing bias-related concepts to compete on equal footing with visual concepts for explaining an attention head's variance
  • Adapts Concept Activation Vectors (CAV) to a zero-shot multimodal setting, deriving concept directions from text embeddings rather than training linear probes on labeled image sets
Evaluation Highlights
  • Ablating 4 identified heads in CLIP ViT-L-14 reduces global gender bias (Cramér’s V) from 0.381 to 0.362 while marginally improving accuracy (+0.42%)
  • Layer-matched random control confirms specificity: random ablation of heads in the same layers does not yield the same bias reduction
  • Class-level analysis shows that a single head in the final layer drives the majority of bias reduction for the most stereotyped profession classes
Breakthrough Assessment
7/10
A strong feasibility study proving demographic bias can be localized to specific heads in discriminative models. While the bias reduction is modest, the methodology for finding 'where' bias lives is novel and rigorous.
×