Locating Demographic Bias at the Attention-Head Level in CLIP's Vision Encoder

📝 Paper Summary

Mechanistic Interpretability Fairness and Bias Auditing

This paper proposes a diagnostic pipeline that locates specific attention heads responsible for demographic bias in CLIP by combining residual-stream decomposition with zero-shot concept projections.

Core Problem

Standard fairness audits quantify *that* a model is biased (e.g., prediction disparities) but cannot explain *where* inside the neural network's architecture this bias originates.

Why it matters:

Foundation models like CLIP systematically replicate societal stereotypes (e.g., misclassifying female doctors as nurses) which propagates into downstream multimodal systems
Existing interpretability methods locate features like color or texture but haven't been adapted to locate demographic bias in discriminative encoders
Without knowing the internal source of bias, mitigation strategies rely on global retraining or output calibration rather than precise architectural surgery

Concrete Example: A CLIP-based occupation classifier misclassifies female doctors as nurses at nearly double the rate of male doctors. Current audits report this percentage but cannot identify which specific attention heads in the transformer are routing the 'female' signal to the 'nurse' prediction.

Key Novelty

Bias-Augmented Projected Residual Stream Analysis

Injects demographic prototypes into the TextSpan dictionary, forcing bias-related concepts to compete on equal footing with visual concepts for explaining an attention head's variance
Adapts Concept Activation Vectors (CAV) to a zero-shot multimodal setting, deriving concept directions from text embeddings rather than training linear probes on labeled image sets

Evaluation Highlights

Ablating 4 identified heads in CLIP ViT-L-14 reduces global gender bias (Cramér’s V) from 0.381 to 0.362 while marginally improving accuracy (+0.42%)
Layer-matched random control confirms specificity: random ablation of heads in the same layers does not yield the same bias reduction
Class-level analysis shows that a single head in the final layer drives the majority of bias reduction for the most stereotyped profession classes

Breakthrough Assessment

7/10

A strong feasibility study proving demographic bias can be localized to specific heads in discriminative models. While the bias reduction is modest, the methodology for finding 'where' bias lives is novel and rigorous.

⚙️ Technical Details

Problem Definition

Setting: Auditing a pre-trained Vision Transformer (ViT) encoder for demographic bias localization without retraining

Inputs: Input images I from the FACET benchmark

Outputs: Ranked list of attention heads (l, h) contributing to demographic bias and bias-mitigated classification predictions

Pipeline Flow

Head Projection (Decompose output into head contributions)
Zero-shot CAV Ranking (Measure alignment with demographic vs. occupation texts)
Threshold Selection (Filter heads by directional specificity and task relevance)
Bias-Augmented TextSpan (Qualitative validation of head semantics)
Validation (Mean ablation of candidate heads)

System Modules

Projected Residual-Stream Decomposition

Decompose the final image representation into additive contributions from individual attention heads

Model or implementation: CLIP ViT-L-14 (Frozen)

Zero-shot CAV Ranker

Compute alignment of each head's output centroid with demographic text prototypes versus occupation prototypes

Model or implementation: CLIP Text Encoder (for prototype embeddings)

Threshold Filter

Select candidate bias heads based on Directional Specificity (gap between top-2 demographics) and Task Relevance

Model or implementation: Grid search (40x60 grid)

Augmented TextSpan

Generate semantic labels for heads to qualitatively verify if they encode demographic concepts

Model or implementation: SVD-based variance explanation

Novel Architectural Elements

Zero-shot multimodal CAVs derived from text embeddings instead of trained linear probes
Bias-augmented TextSpan dictionary (injecting bias terms to compete with visual terms for variance explanation)

Modeling

Base Model: CLIP ViT-L-14 (pretrained on LAION-2B)

Comparison to Prior Work

vs. DiffLens: Operates on discriminative encoders (CLIP) rather than generative U-Nets; addresses routing via attention heads rather than bottleneck features
vs. TCAV: Zero-shot (uses text embeddings instead of image probes) and operates at head-level granularity rather than layer-level
vs. Standard Fairness Audits: Explains *where* bias is located architecturally rather than just quantifying output disparities

Limitations

Ablation is a diagnostic tool, not a full debiasing strategy (removing one bias direction can displace predictions to another biased state)
Gender analysis excluded Non-Binary individuals due to insufficient sample size in FACET (<20 images per class)
Age bias was found to be more diffuse and less localizable than gender bias
Threshold selection relies on grid search over the evaluation set, introducing potential circularity (mitigated by random controls)

📊 Experiments & Results

Evaluation Setup

Classification of 42 profession classes using CLIP zero-shot inference, analyzed for fairness across demographic groups

Benchmarks:

FACET (Fairness in Computer Vision (Classification))

Metrics:

Cramér’s V (Bias magnitude)
Classification Accuracy
Chi-squared statistic
Statistical methodology: Chi-squared tests with Benjamini–Hochberg correction (FDR alpha=0.05). Comparison against 10-seed layer-matched random control.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
FACET (42 classes)	Cramér’s V (Global)	0.381	0.362	-0.019
FACET (42 classes)	Accuracy	Not reported in the paper	Not reported in the paper	+0.42%

Main Takeaways

Gender bias is localizable to a small set of terminal-layer attention heads; ablating them reduces bias without harming accuracy.
The 'Feasibility' of localization varies by attribute: Age bias appears more diffuse and harder to localize than gender bias.
Bias-augmented TextSpan confirms that identified heads are semantically encoding demographic concepts (e.g., 'gender_female') rather than just task-relevant features.
A single head in the final layer is responsible for the majority of bias reduction in the most stereotyped classes.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Self-Attention, Residual Streams)
Mechanistic Interpretability concepts
CLIP (Contrastive Language-Image Pre-training) fundamentals

Key Terms

Residual Stream: The primary vector pathway in a Transformer where each layer adds its output to the existing representation, allowing decomposability

TextSpan: An algorithm that projects attention head outputs into a shared text-image space to assign human-readable labels to what the head encodes

CAV: Concept Activation Vectors—directions in activation space that correspond to specific concepts (e.g., 'gender'), usually found via linear classifiers

Mean Ablation: Replacing the output of a specific attention head with its average output across the dataset, neutralizing its input-specific signal while keeping static statistics

Cramér’s V: A statistical measure of association between two nominal variables (here, demographic group and predicted class), used to quantify bias magnitude

FACET: A benchmark dataset for fairness in computer vision containing images labeled with occupations and demographic attributes

Zero-shot: Performing a task (here, concept definition) without using explicit training examples, relying instead on pre-trained text embeddings