Towards Understanding the Effect of Pretraining Label Granularity

📝 Paper Summary

Transfer Learning Supervised Pre-training Label Space Design

Fine-grained pre-training improves downstream performance on coarse tasks by forcing models to learn rare features that are otherwise ignored due to simplicity bias during coarse-grained training.

Core Problem

Neural networks tend to learn only the most common, simple features ('simplicity bias') when trained on coarse labels, failing to learn fine-grained features necessary for recognizing hard samples.

Why it matters:

Large-scale pre-training (e.g., ImageNet21k) is standard, but the specific impact of label hierarchy choice is under-explored
Training on massive datasets with overly coarse labels may waste data potential by not incentivizing the model to learn rich feature representations
Understanding this helps optimize data labeling strategies for foundation models

Concrete Example: In a 'cat vs dog' task, a model might only learn common ear shapes (shortcut), failing on a rare 'Persian cat' where those shapes are obscured; fine-grained 'breed' labels force the model to learn breed-specific textures, fixing this.

Key Novelty

Granularity-Generalization Correspondence

Theoretically proves that coarse training only learns 'common' features, while fine-grained training forces the learning of 'rare' features
Demonstrates that pre-training label granularity must align with the target task's underlying feature hierarchy to be effective
Identifies a 'U-shaped' performance curve: extremely fine labels (e.g., unique ID per sample) hurt performance, as do overly coarse labels

Architecture

Theoretical data model illustrating 'common' vs 'fine-grained' features in image patches

Evaluation Highlights

Pre-training on ImageNet21k leaf labels (level 0) achieves 82.51% accuracy on ImageNet1k, significantly outperforming the 77.91% baseline
Pre-training on coarser ImageNet21k levels (e.g., level 9) degrades performance to 72.75%, worse than training from scratch
Manual hierarchies consistently outperform clustering-based or random label hierarchies for transfer learning on iNaturalist 2021

Breakthrough Assessment

7/10

Provides strong empirical and theoretical backing for the intuition that 'finer labels are better,' while adding nuance about alignment and the limits of granularity.

⚙️ Technical Details

Problem Definition

Setting: Fine-to-coarse transfer learning: Pre-train on source dataset with fine-grained labels, transfer to target dataset with coarse-grained labels.

Inputs: Source dataset (images, fine labels), Target dataset (images, coarse labels)

Outputs: Trained classifier for the target coarse label set

Pipeline Flow

Pre-training Phase: Train backbone on Source Data with Fine Labels
Transfer Phase: Discard classification head, keep backbone
Fine-tuning Phase: Train new head (and optionally backbone) on Target Data with Coarse Labels

System Modules

Backbone

Learn visual features from images

Model or implementation: ViT-B/16 (ImageNet), ResNet34/50 (iNaturalist)

Pre-training Head (Classification)

Classify images into fine-grained source categories

Model or implementation: Linear Layer

Target Head (Classification)

Classify images into coarse-grained target categories

Model or implementation: Linear Layer

Modeling

Base Model: ViT-B/16 (Vision Transformer) for ImageNet experiments; ResNet34/50 for iNaturalist experiments

Training Method: Supervised Pre-training followed by Fine-tuning

Objective Functions:

Purpose: Minimize classification error.

Formally: Cross-Entropy Loss

Adaptation: Fine-tuning on target dataset

Key Hyperparameters:

pretraining_epochs: 90 (iNaturalist)
batch_size: Large batch (specifics not listed in text snippet for ViT, standard for ResNet)
optimizer: Not explicitly detailed in text snippet

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Pre-training: Uses significantly finer label granularity (21k vs 1k classes)
vs. Clustering-based labels (Yan et al. 2020): Shows manual hierarchies align better with target tasks than automatic clustering [cited in paper]

Limitations

Theoretical analysis assumes a simplified data distribution with disjoint patches for features
Requires a source dataset with a meaningful label hierarchy that aligns with the target task
Extremely fine granularity (e.g., one label per sample) degrades performance
Focuses primarily on the 'fine-to-coarse' transfer direction

Reproducibility

Code availability is not explicitly mentioned. Experimental setup for ImageNet21k follows Dosovitskiy et al. (2021). Dataset generation for manual hierarchies is described.

📊 Experiments & Results

Evaluation Setup

Pre-train on fine-grained source dataset, fine-tune on coarse-grained target dataset.

Benchmarks:

ImageNet21k to ImageNet1k (Image Classification Transfer)
iNaturalist 2021 (Intra-dataset Transfer (Fine to Superclass))

Metrics:

Top-1 Validation Accuracy
Validation Error
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Results on ImageNet21k to ImageNet1k transfer using ViT-B/16 demonstrate that pre-training on finer labels (leaf nodes) yields the best downstream accuracy.
ImageNet1k	Top-1 Validation Accuracy	77.91	82.51	+4.60
ImageNet1k	Top-1 Validation Accuracy	77.91	81.28	+3.37
ImageNet1k	Top-1 Validation Accuracy	77.91	80.26	+2.35
ImageNet1k	Top-1 Validation Accuracy	77.91	72.75	-5.16

Experiment Figures

Validation accuracy on ImageNet1k plotted against the number of pre-training classes (granularity) in ImageNet21k

Validation error on iNaturalist 2021 superclasses for different pre-training hierarchies (Manual vs Clustering vs Random)

Main Takeaways

Label granularity matters: Pre-training on leaf labels (finest) consistently outperforms coarser levels and baselines.
Coarse pre-training can be harmful: Using very coarse pre-training labels (e.g., 38 classes for ImageNet21k) performs worse than training from scratch (baseline).
Hierarchy alignment is critical: On iNaturalist, manual hierarchies outperform clustering-based labels, showing that semantic alignment between source and target labels is necessary for effective transfer.
Granularity has a 'sweet spot': Performance follows a U-shaped curve; unique-label-per-sample (extreme fine-grained) fails, as does extreme coarse-grained.

📚 Prerequisite Knowledge

Prerequisites

Transfer learning (Pre-training and Fine-tuning)
Image Classification basics
Feature learning theory (Simplicity Bias)

Key Terms

Fine-to-coarse transfer: A transfer learning setting where the pre-training task has more specific classes (e.g., breeds) than the downstream target task (e.g., species)

Simplicity bias: The tendency of neural networks to rely on the simplest/most common features to solve a task, often ignoring complex but useful features

ImageNet21k: A large-scale image database with approximately 21,000 categories, commonly used for pre-training vision models

Leaf labels: The most specific labels at the bottom of a hierarchy tree (e.g., specific dog breeds in WordNet)

WordNet: A lexical database of English used to structure the class hierarchy in ImageNet

Label granularity: The level of detail in class labels; high granularity means many specific classes, low granularity means fewer broad classes